80 datasets found

d
Replication Data for: realdata
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xu, Ningning (2023). Replication Data for: realdata [Dataset]. http://doi.org/10.7910/DVN/AFZZVP
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AFZZVP
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Xu, Ningning
Description
(1) dataandpathway_eisner.R, dataandpathway_bordbar.R, dataandpathway_taware.R and dataandpathway_almutawa.R: functions and codes to clean the realdata sets and obtain the annotation databases, which are save as .RData files in sudfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (2) FWER_excess.R: functions to show the inflation of FWER when integrating multiple annotation databases and to generate Table 1. (3) data_info.R: code to obtain Table 2 and Table 3. (4) rejections_perdataset.R and triangulartable.R: functions to generate Table 4. The runing time of rejections_perdataset.R is 7 hours around, we thus save the corresponding results as res_eisner.RData, res_bordbar.RData, res_taware.RData and res_almutawa.RData in subfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (5) pathwaysizerank.R: code for generating Figure 4 based on res_eisner.RData from (h). (6) iterationandtime_plot.R: code for generating Figure 5 based on “Al-Mutawa” data. The code is really time-consuming, nearly 5 days, we thus save the corresponding results and plot them in the main manuscript by pgfplot.
Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zip
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
HIV Vaccine Trials Networkhttp://www.hvtn.org/
HIV Prevention Trials Network
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
f
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

d
Data release for solar-sensor angle analysis subset associated with the...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data release for solar-sensor angle analysis subset associated with the journal article "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States" [Dataset]. https://catalog.data.gov/dataset/data-release-for-solar-sensor-angle-analysis-subset-associated-with-the-journal-article-so
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Western United States, United States
Description
This dataset provides geospatial location data and scripts used to analyze the relationship between MODIS-derived NDVI and solar and sensor angles in a pinyon-juniper ecosystem in Grand Canyon National Park. The data are provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and scripts allow users to replicate, test, or further explore results. The file GrcaScpnModisCellCenters.csv contains locations (latitude-longitude) of all the 250-m MODIS (MOD09GQ) cell centers associated with the Grand Canyon pinyon-juniper ecosystem that the Southern Colorado Plateau Network (SCPN) is monitoring through its land surface phenology and integrated upland monitoring programs. The file SolarSensorAngles.csv contains MODIS angle measurements for the pixel at the phenocam location plus a random 100 point subset of pixels within the GRCA-PJ ecosystem. The script files (folder: 'Code') consist of 1) a Google Earth Engine (GEE) script used to download MODIS data through the GEE javascript interface, and 2) a script used to calculate derived variables and to test relationships between solar and sensor angles and NDVI using the statistical software package 'R'. The file Fig_8_NdviSolarSensor.JPG shows NDVI dependence on solar and sensor geometry demonstrated for both a single pixel/year and for multiple pixels over time. (Left) MODIS NDVI versus solar-to-sensor angle for the Grand Canyon phenocam location in 2018, the year for which there is corresponding phenocam data. (Right) Modeled r-squared values by year for 100 randomly selected MODIS pixels in the SCPN-monitored Grand Canyon pinyon-juniper ecosystem. The model for forward-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle. The model for back-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle + sensor zenith angle. Boxplots show interquartile ranges; whiskers extend to 10th and 90th percentiles. The horizontal line marking the average median value for forward-scatter r-squared (0.835) is nearly indistinguishable from the back-scatter line (0.833). The dataset folder also includes supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). The empty folder GEE_DataAngles is included so that the user can save the data files from the Google Earth Engine scripts to this location, where they can then be incorporated into the r-processing scripts without needing to change folder names. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs.
g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
Explore at:
Unique identifier
https://doi.org/10.3886/E105403V3
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
MW3-Dataset
figshare.com
application/x-gzip
Updated Aug 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuang Zhang (2023). MW3-Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.22154066.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22154066.v3
Dataset updated
Aug 14, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Shuang Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset generated by our Microwell-seq 3.0 technique.

Files: In order to save space, we've packaged our data into tar.gz format. Please unzip the files once you've successfully downloaded. RNA_WT_RData.tar.gz: Seurat object along with a metadata including cell barcodes, tissue source & cell type annotation, could be loaded into R environment and used directly. RNA_Tumor_RData.tar.gz: Seurat object along with a metadata including cell barcodes, tissue source, cell type annotation & potential cell state prediction(neoplastic, intermediate & non-neoplastic), could be loaded into R environment and used directly. RNA_WT_Dge.tar.gz: Digital Expression data (in .csv format) generated by Drop-seq tools, with batch effect removed by customed scripts. RNA_Tumor_Dge.tar.gz : Digital Expression data(in .csv format) generated by Drop-seq tools, with batch effect removed by customed scripts. ATAC_WT_SparseMatrix.tar.gz: scATAC-seq data in 10X-like format(matrix.mtx, barcodes.csv, features.csv), along with a metadata including cell barcodes, tissue source & cell type annotation. ATAC_Tumor_SparseMatrix.tar.gz: scATAC-seq data in 10X-like format(matrix.mtx, barcodes.csv, features.csv), along with a metadata including cell barcodes, tissue source, cell type annotation & potential cell state prediction(neoplastic, intermediate & non-neoplastic).
H
Area Resource File (ARF)
dataverse.harvard.edu
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). Area Resource File (ARF) [Dataset]. http://doi.org/10.7910/DVN/8NMSFV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8NMSFV
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the area resource file (arf) with r the arf is fun to say out loud. it's also a single county-level data table with about 6,000 variables, produced by the united states health services and resources administration (hrsa). the file contains health information and statistics for over 3,000 us counties. like many government agencies, hrsa provides only a sas importation script and an as cii file. this new github repository contains two scripts: 2011-2012 arf - download.R download the zipped area resource file directly onto your local computer load the entire table into a temporary sql database save the condensed file as an R data file (.rda), comma-separated value file (.csv), and/or stata-readable file (.dta). 2011-2012 arf - analysis examples.R limit the arf to the variables necessary for your analysis sum up a few county-level statistics merge the arf onto other data sets, using both fips and ssa county codes create a sweet county-level map click here to view these two scripts for mo re detail about the area resource file (arf), visit: the arf home page the hrsa data warehouse notes: the arf may not be a survey data set itself, but it's particularly useful to merge onto other survey data. confidential to sas, spss, stata, and sudaan users: time to put down the abacus. time to transition to r. :D
d
R script to create boxplots of change factors by NOAA Atlas 14 station, or...
catalog.data.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). R script to create boxplots of change factors by NOAA Atlas 14 station, or for all stations in an ArcHydro Enhanced Database (AHED) basin or county (create_boxplot.R) [Dataset]. https://catalog.data.gov/dataset/r-script-to-create-boxplots-of-change-factors-by-noaa-atlas-14-station-or-for-all-stations
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The South Florida Water Management District (SFWMD) and the U.S. Geological Survey have developed projected future change factors for precipitation depth-duration-frequency (DDF) curves at 174 National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations in central and south Florida. The change factors were computed as the ratio of projected future to historical extreme precipitation depths fitted to extreme precipitation data from various downscaled climate datasets using a constrained maximum likelihood (CML) approach. The change factors correspond to the period 2050-2089 (centered in the year 2070) as compared to the 1966-2005 historical period. An R script (create_boxplot.R) is provided which generates boxplots of change factors for a NOAA Atlas 14 station, or for all NOAA Atlas 14 stations in an ArcHydro Enhanced Database (AHED) basin or county for durations of interest (1, 3, and 7 days, or combinations thereof) and return periods of interest (5, 10, 25, 50, 100, and 200 years, or combinations thereof). The user also has the option of requesting that the script save the raw change factor data used to generate the boxplots, as well as the processed quantile and outlier data shown in the figure. The script allows the user to modify the percentiles used in generating the boxplots. A Microsoft Word file documenting code usage and available options is also provided within this data release (Documentation_R_script_create_boxplot.docx). As described in the documentation, the R script relies on some of the Microsoft Excel spreadsheets published as part of this data release.
d
Replication Data for \"News from the Other Side: How Topic Relevance Limits...
dataone.org
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mummolo, Jonathan (2023). Replication Data for \"News from the Other Side: How Topic Relevance Limits the Prevalence of Partisan Selective Exposure\" [Dataset]. https://dataone.org/datasets/sha256%3A34be755b77da5d25c00a13f766d5b0acdcadf930618153c9fc1d83ae17c5ee9b
Explore at:
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Mummolo, Jonathan
Time period covered
Jun 11, 2014 - Jun 16, 2014
Description
Included are survey data sets and .R script files necessary to replicate all tables and figures. Tables will display in the R console. Figures will save as .pdf files ot your working directory. Instructions for Replication: These materials will allow for replication in R. You can download data files in .R or .tab format. Save all files in a common folder (directory). Open the .R script file named “jop_replication_dataverse2.R” and change the working directory at the top of the script to the directory where you saved the replication materials. Execute the code in this script file to generate all tables and figures displayed in the manuscript. The script is annotated. Take care to execute the appropriate lines when loading data sets depending on whether you downloaded the data in .R or .tab format (the script is written to accommodate both formats). Note: the files "results.diff_rep.Rdata" and "results.diff2.Rdata" are R list objects and can only be opened in R. Should you encounter any problems or have any questions, please contact the author at jmummolo@stanford.edu.
d
Health and Retirement Study (HRS)
search.dataone.org
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ELEKOY
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
o
Data and code for "Plastic bag bans and fees reduce harmful bag litter on...
openicpsr.org
delimited
Updated Apr 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Papp; Kimberly Oremus (2024). Data and code for "Plastic bag bans and fees reduce harmful bag litter on shorelines" [Dataset]. http://doi.org/10.3886/E200661V3
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E200661V3
Dataset updated
Apr 14, 2024
Dataset provided by
University of Delaware
Columbia University
Authors
Anna Papp; Kimberly Oremus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code and data for "Plastic bag bans and fees reduce harmful bag litter on shorelines " by Anna Papp and Kimberly Oremus.Please see included README file for details: This folder includes code and data to fully replicate Figures 1-5. In addition, the folder also includes instructions to rerun data cleaning steps. Last modified: March 6, 2025For any questions, please reach out to ap3907@columbia.edu._Code (replication/code):To replicate main figures, run each file for each main figure: - 1_figure1.R- 1_figure2.R- 1_figure3.R - 1_figure4.R- 1_figure5.R Update the home directory to match where the directory is saved ("replication" folder) in this file before running it. The code will require you to install packages (see note on versions below).To replicate entire data cleaning pipeline:- First download all required data (explained in Data section below). - Run code in code/0_setup folder (refer to separate README file)._ R-Version and Package VersionsThe project was developed and executed using:- R version: 4.0.0 (2024-04-24)- Platform: macOS 13.5 Code was developed and main figures were created using the following versions: - data.table: 1.14.2- dplyr: 1.1.4- readr: 2.1.2- tidyr: 1.2.0- broom: 0.7.12- stringr: 1.5.1- lubridate: 1.7.9- raster: 3.5.15- sf: 1.0.7- readxl: 1.4.0- cobalt: 4.4.1.9002- spdep: 1.2.3- ggplot2: 3.4.4- PNWColors: 0.1.0- grid: 4.0.0- gridExtra: 2.3- ggpubr: 0.4.0- knitr: 1.48- zoo: 1.8.12 - fixest: 0.11.2- lfe: 2.8.7.1 - did: 2.1.2- didimputation: 0.3.0 - DIDmultiplegt: 0.1.0- DIDmultiplegtDYN: 1.0.15- scales: 1.2.1 - usmap: 0.6.1 - tigris: 2.0.1 - dotwhisker: 0.7.4_Data Processed data files are provided to replicate main figures. To replicate from raw data, follow the instructions below.Policies (needs to be recreated or email for version): Compiled from bagtheban.com/in-your-state/, rila.org/retail-compliance-center/consumer-bag-legislation, baglaws.com, nicholasinstitute.duke.edu/plastics-policy-inventory, and wikipedia.org/wiki/Plastic_bag_bans_in_the_United_States; and massgreen.org/plastic-bag-legislation.html and cawrecycles.org/list-of-local-bag-bans to confirm legislation in Massachusetts and California.TIDES (needs to be downloaded for full replication): Download cleanup data for the United States from Ocean Conservancy (coastalcleanupdata.org/reports). Download files for 2000-2009, 2010-2014, and then each separate year from 2015 until 2023. Save files in the data/tides directory, as year.csv (and 2000-2009.csv, 2010-2014.csv) Also download entanglement data for each year (2016-2023) separately in a file called data/tides/entanglement (each file should be called 'entangled-animals-united-states_YEAR.csv').Shapefiles (needs to be downloaded for full replication): Download shapefiles for processing cleanups and policies. Download county shapefiles from the US Census Bureau; save files in the data/shapefiles directory, county shapefile should be in folder called county (files called cb_2018_us_county_500k.shp). Download TIGER Zip Code tabulation areas from the US Census Bureau (through data.gov); save files in the data/shapefiles directory, zip codes shapefile folder and files should be called tl_2019_us_zcta510.Other: Helper files with US county and state fips codes, lists of US counties and zip codes in data/other directory, provided in the directory except as follows. Download zip code list and 2020 IRS population data from United States zip codes and save as uszipcodes.csv in data/other directory. Download demographic characteristics of zip codes from Social Explorer and save as raw_zip_characteristics.csv in data/other directory.Refer to the .txt files in each data folder to ensure all necessary files are downloaded.
Long-term global vegetation and climate index datasets
zenodo.org
sh, text/x-python
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Won-Jun Choi; Won-Jun Choi; Hwan-Jin Song; Hwan-Jin Song (2025). Long-term global vegetation and climate index datasets [Dataset]. http://doi.org/10.5281/zenodo.15048700
Explore at:
sh, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15048700
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Won-Jun Choi; Won-Jun Choi; Hwan-Jin Song; Hwan-Jin Song
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NDVI Data Set (1. NDVI.nc)

Global Vegetation Greenness (NDVI) from AVHRR GIMMS-3G+, 1981-2022

Variable: Normalized Difference Vegetation Index (NDVI)

Area: Global (60°S ~ 70°N, -180°W ~ 180°E)

Period: 1982-01-01 ~ 2022-12-31

Horizontal resolution: 0.25° × 0.25° (Regridded from original 0.0833° × 0.0833°)

Temporal resolution: Bi-monthly (1st–15th and 16th–end of each month)

Source: https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=2187

Meteorological Data Set (2.Temperature.nc, ... , 6.Cloud_cover.nc)

Agrometeorological indicators from 1979 to present derived from reanalysis, Copernicus Climate Change Service

Variable: Normalized Difference Vegetation Index (NDVI)

Area: Global (60°S ~ 70°N, -180°W ~ 180°E)

Period: 1982-01-01 ~ 2022-12-31

Horizontal resolution: 0.25° × 0.25° (Regridded from original 0.1° × 0.1°)

Temporal resolution: Bi-monthly (1st–15th and 16th–end of each month)

The meteorological data were converted from daily values to bi-monthly average values.

Variables: 2m temperature (K), 2m relative humidity (%), 10m wind speed (m s⁻¹), Precipitation flux (mm day⁻¹), Solar radiation flux (J m⁻² day⁻¹), Cloud cover (dimensionless)

Source: https://cds.climate.copernicus.eu/datasets/sis-agrometeorological-indicators?tab=overview

Pre-processing code (Set_data_1~3)

Set_data_1 : Combining raw data for NDVI and checking for missing values in the original data

Set_data_2 : Combining annual data, calculating semi-monthly averages, and setting the latitude and longitude ranges for meteorological data.

Set_data_3 : Synchronization of latitude, longitude, and resolution between NDVI and meteorological data.

Analysis code (code1~5)

code_1 : This script processes climate data for variables by calculating their seasonal anomalies and time-averaged values. It performs the following steps:

Monthly Mean Calculation: The script first calculates the monthly mean for each variable over a period of 41 years.

Semi-Monthly Mean Calculation: It then computes the semi-monthly mean by combining the monthly mean data.

Anomaly Calculation: The script calculates the anomaly by subtracting the semi-monthly mean from the original data.

Time Mean Calculation: Finally, the time-mean for the entire time period is calculated for each variable.

code_2 : This script calculates the linear regression slope, intercept, correlation coefficient (r-value), p-value, and standard error for a given climate variable (in this case, temperature anomaly) over time at each latitude and longitude point. The steps involved are:

Load Data: The script loads the input NetCDF file and extracts the time and temperature anomaly (or other climate data) values.

Linear Regression: For each spatial point (latitude, longitude), the script performs a linear regression between time and the corresponding climate data to determine the slope, intercept, r-value, p-value, and standard error.

Save Results: The regression results are saved into a new NetCDF file with variables for slope, intercept, r-value, p-value, and standard error for each latitude and longitude point.

code_3 : This script processes NDVI (Normalized Difference Vegetation Index) data by performing the following steps:

Prepare Heatmap Data: It reshapes the NDVI data into a 4D array of the shape (latitude, longitude, years, 24 months), where each year contains 24 months of data.

Compute NDVI Histograms: It computes histograms of the NDVI data for each latitude, longitude, and year, adjusting the NDVI values into 20 bins for analysis.

Save Histogram Data: The histogram data is then saved to a .npy file, which stores the data for further analysis.

code_4 : This script performs k-means clustering on NDVI data, based on histograms of NDVI values:

Load Data: It loads NDVI data and histogram data (NDVI values in bins) from files.

Filter Data: It filters out regions with zero values to focus on areas of interest.

Reshape Data: The data is reshaped into a 2D format to prepare for clustering.

K-Means Clustering: The script applies k-means clustering to the reshaped histogram data.

Mean NDVI Calculation: It calculates the mean NDVI value for each cluster by extracting values from the NDVI data.

Reordering Clusters: The clusters are reordered based on their mean NDVI values.

Save Results: Finally, the script saves the cluster labels and non-zero indices into separate files.

code_5 : This script processes NDVI (Normalized Difference Vegetation Index) data by clustering and saving the data for each cluster.

Load Data

Loads NDVI slope data (slope) from a NetCDF file.

Loads precomputed cluster labels (cluster_labels_8.npy) and valid data locations (non_zero_indices_8.npy).

Save NDVI Data by Cluster

Categorizes NDVI data based on clusters.

Creates a 2D array for each cluster and assigns NDVI data to the corresponding cluster coordinates.

Saves the clustered NDVI data as .npy files for further analysis.

Create Directory and Execute

Creates the output directory (if it does not exist).

Runs the main function to save the clustered NDVI data.

Acknowledgments

This work was also supported by Global - Learning & Academic research institution for Master’s·PhD students, and Postdocs (LAMP) Program of the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (No. RS-2023-00301914).
d
R script to create boxplots of change factors by NOAA Atlas 14 station, or...
catalog.data.gov
data.usgs.gov
Updated Jul 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). R script to create boxplots of change factors by NOAA Atlas 14 station, or for all stations in a Florida HUC-8 basin or county (create_boxplot.R) [Dataset]. https://catalog.data.gov/dataset/r-script-to-create-boxplots-of-change-factors-by-noaa-atlas-14-station-or-for-all-stations-355d8
Explore at:
Dataset updated
Jul 20, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The Florida Flood Hub for Applied Research and Innovation and the U.S. Geological Survey have developed projected future change factors for precipitation depth-duration-frequency (DDF) curves at 242 National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations in Florida. The change factors were computed as the ratio of projected future to historical extreme-precipitation depths fitted to extreme-precipitation data from downscaled climate datasets using a constrained maximum likelihood (CML) approach as described in https://doi.org/10.3133/sir20225093. The change factors correspond to the periods 2020-59 (centered in the year 2040) and 2050-89 (centered in the year 2070) as compared to the 1966-2005 historical period. An R script (create_boxplot.R) is provided which generates boxplots of change factors for a NOAA Atlas 14 station, or for all NOAA Atlas 14 stations in a Florida HUC-8 basin or county for durations of interest (1, 3, and 7 days, or combinations thereof) and return periods of interest (5, 10, 25, 50, 100, 200, and 500 years, or combinations thereof). The user also has the option of requesting that the script save the raw change factor data used to generate the boxplots, as well as the processed quantile and outlier data shown in the figure. The script allows the user to modify the percentiles used in generating the boxplots. A Microsoft Word file documenting code usage and available options is also provided within this data release (Documentation_R_script_create_boxplot.docx). As described in the documentation, the R script relies on some of the Microsoft Excel spreadsheets published as part of this data release. The script uses basins defined in the "Florida Hydrologic Unit Code (HUC) Basins (areas)" from the Florida Department of Environmental Protection (FDEP; https://geodata.dep.state.fl.us/datasets/FDEP::florida-hydrologic-unit-code-huc-basins-areas/explore) and their names are listed in the file basins_list.txt provided with the script. County names are listed in the file counties_list.txt provided with the script. NOAA Atlas 14 stations located in each Florida HUC-8 basin or county are defined in the Microsoft Excel spreadsheet Datasets_station_information.xlsx which is part of this data release. Instructions are provided in code documentation (see highlighted text on page 7 of Documentation_R_script_create_boxplot.docx) so that users can modify the script to generate boxplots for basins different from the FDEP "lorida Hydrologic Unit Code (HUC) Basins (areas).
o
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
openicpsr.org
Updated May 18, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1991-2022 [Dataset]. http://doi.org/10.3886/E103500V10
Explore at:
Unique identifier
https://doi.org/10.3886/E103500V10
Dataset updated
May 18, 2018
Dataset provided by
Princeton University
Authors
Jacob Kaplan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1991 - 2021
Area covered
United States
Description
!!!WARNING~~~This dataset has a large number of flaws and is unable to properly answer many questions that people generally use it to answer, such as whether national hate crimes are changing (or at least they use the data so improperly that they get the wrong answer). A large number of people using this data (academics, advocates, reporting, US Congress) do so inappropriately and get the wrong answer to their questions as a result. Indeed, many published papers using this data should be retracted. Before using this data I highly recommend that you thoroughly read my book on UCR data, particularly the chapter on hate crimes (https://ucrbook.com/hate-crimes.html) as well as the FBI's own manual on this data. The questions you could potentially answer well are relatively narrow and generally exclude any causal relationships. ~~~WARNING!!!For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 10 release notes:Adds 2022 dataVersion 9 release notes:Adds 2021 data.Version 8 release notes:Adds 2019 and 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last UCR hate crime data they release. Changes .rda file to .rds.Version 7 release notes:Changes release notes description, does not change data.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), made all character values lower case, reordered columns. I also generated incident month, weekday, and month-day variables from the incident date variable included in the original data.
Data from: Shifting limitation of primary production: experimental support...
caryinstitute.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaija Gahm; Carly Olson; Chris Solomon; Stuart Jones (2023). Shifting limitation of primary production: experimental support for a new model in lake ecosystems [Dataset]. http://doi.org/10.25390/caryinstitute.12821054.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25390/caryinstitute.12821054.v1
Dataset updated
May 31, 2023
Dataset provided by
Cary Institute of Ecosystem Studies
Authors
Kaija Gahm; Carly Olson; Chris Solomon; Stuart Jones
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The limits on primary production vary in complex ways across space and time. Strong tests of clear conceptual models have been instrumental in understanding these patterns in both terrestrial and aquatic ecosystems. Here we present the first experimental test of a new model describing how shifts from nutrient to light limitation control primary productivity in lake ecosystems as hydrologic inputs of nutrients and organic matter vary. We found support for two key predictions of the model: that gross primary production (GPP) follows a hump-shaped relationship with increasing dissolved organic carbon (DOC) concentrations; and that the maximum GPP, and the critical DOC concentration at which the hump occurs, are determined by the stoichiometry and chromophoricity of the hydrologic inputs. Our results advance fundamental understanding of the limits on aquatic primary production, and have important applications given ongoing anthropogenic alterations of the nutrient and organic matter inputs to surface waters.Data and R code have been made available to conduct statistical analyses and generate Figures 2-4. Figures 2 & 4 use the same data file titled 'compiledNutrientLightData.csv'. R code titled 'NutrientLightData_Analyses.R' and 'nutrientLimitationAnalysis.R' generate Figure 2 and Figure 4, respectively. Files for Figure 3 are titled 'gppData.csv' and 'gppAnalyses.R'. Save files and set working directory in R code to where data files are saved.
n
SDM env predictor comparison dataset
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Catherine Buckland (2023). SDM env predictor comparison dataset [Dataset]. http://doi.org/10.5061/dryad.wwpzgmsmt
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wwpzgmsmt
Dataset updated
Feb 10, 2023
Dataset provided by
University of Oxford
Authors
Catherine Buckland
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Identifying the environmental drivers of the global distribution of succulent plants using the crassulacean acid metabolism pathway of photosynthesis has previously been investigated through ensemble-modelling of species delimiting the realised niche of the natural succulent biome. An alternative approach, which may provide further insight into the fundamental niche of succulent plants in the absence of dispersal limitation, is to model the distribution of selected species that are globally widespread and have become naturalised far beyond their native habitats. This could be of interest, for example, in defining areas that may be suitable for cultivation of alternative crops resilient to future climate change. We therefore explored the performance of climate-only species distribution models in predicting the drivers and distribution of two widespread CAM plants, Opuntia ficus-indica and Euphorbia tirucalli. Using two different algorithms and five predictor sets, we created distribution models for these examplar species and produced an updated map of global inter-annual rainfall predictability. No single predictor set produced markedly more accurate models, with the basic bioclim-only predictor set marginally out-performing combinations with additional predictors. Minimum temperature of the coldest month was the single most important variable in determining spatial distribution, but additional predictors such as precipitation and inter-annual precipitation variability were also important in explaining the differences in spatial predictions between SDMs. When compared against previous projections, an a posteriori approach correctly does not predict distributions in areas of ecophysiological tolerance yet known absence (e.g. due to biotic competition). An updated map of inter-annual rainfall predictability has successfully identified regions known to be depauperate in succulent plants. High model performance metrics suggest that the majority of potentially suitable regions for these species are predicted by these models with a limited number of climate predictors, and there is no benefit in expanding model complexity and increasing the potential for overfitting. Methods R code and additional raster datasets attached for recreating the data, models and results presented in Buckland et al. (2022) - Ecol & Evolution. Save the additional raster datasets to the working directory, and update the file pathways in the R code to the relevant directory locations. Please refer to README file for futher information, or contact the corresponding author: catherine.buckland@ouce.ox.ac.uk
P
DQN Replay Dataset Dataset
paperswithcode.com
library.toponeai.link
Updated Jul 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishabh Agarwal; Dale Schuurmans; Mohammad Norouzi (2021). DQN Replay Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/dqn-replay-dataset
Explore at:
Dataset updated
Jul 23, 2021
Authors
Rishabh Agarwal; Dale Schuurmans; Mohammad Norouzi
Description
The DQN Replay Dataset was collected as follows: We first train a DQN agent, on all 60 Atari 2600 games with sticky actions enabled for 200 million frames (standard protocol) and save all of the experience tuples of (observation, action, reward, next observation) (approximately 50 million) encountered during training.

This logged DQN data can be found in the public GCP bucket gs://atari-replay-datasets which can be downloaded using gsutil. To install gsutil, follow the instructions here.

After installing gsutil, run the command to copy the entire dataset:

gsutil -m cp -R gs://atari-replay-datasets/dqn

To run the dataset only for a specific Atari 2600 game (e.g., replace GAME_NAME by Pong to download the logged DQN replay datasets for the game of Pong), run the command:

gsutil -m cp -R gs://atari-replay-datasets/dqn/[GAME_NAME]

This data can be generated by running the online agents using batch_rl/baselines/train.py for 200 million frames (standard protocol). Note that the dataset consists of approximately 50 million experience tuples due to frame skipping (i.e., repeating a selected action for k consecutive frames) of 4. The stickiness parameter is set to 0.25, i.e., there is 25% chance at every time step that the environment will execute the agent's previous action again, instead of the agent's new action.
g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2018 [Dataset]. http://doi.org/10.3886/E105403
Explore at:
Unique identifier
https://doi.org/10.3886/E105403
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 4 release notes:Adds data for 2018Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
d
Daily summaries of stream temperature at sites in Alaska, U.S. Geological...
dataone.org
search.dataone.org
+1more
Updated Jan 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Geological Survey (2019). Daily summaries of stream temperature at sites in Alaska, U.S. Geological Survey (USGS) [Dataset]. http://doi.org/10.5063/F1NC5ZFF
Explore at:
Unique identifier
https://doi.org/10.5063/F1NC5ZFF
Dataset updated
Jan 3, 2019
Dataset provided by
Knowledge Network for Biocomplexity
Authors
United States Geological Survey
Area covered

Variables measured
Date, Link, NOTES, years, Active, Status, site_no, End_date, Latitude, max_temp, and 31 more
Description
Stream temperature is an important parameter to ecology, climate, and hydrology studies in Alaska. The United States Geological Survey (USGS) takes continuous stream temperature measurements at various sites around the state as part of the National Water Information Service (NWIS). Here the daily min, mean, and max temperature data (degrees C) at 116 stations across Alaska are included. Data are provided in site-level files, with an additional file describing site and station information. USGS generated flags indicating suspect data are also included in each site-level data file. The data was downloaded from USGS via the USGS R package ‘dataRetrieval’. Included in this data package is the R script, ‘USGS_StreamTemperature_Data_Download.R’, that was used to download this data. This package can be used to download similar data from the USGS data repository. The R script includes code that will save all data to one master file, split the files by site (as done here), as well as code that will save data by individual site number, if needed.

Facebook

Twitter

Click to copy link

Link copied

Cite

Xu, Ningning (2023). Replication Data for: realdata [Dataset]. http://doi.org/10.7910/DVN/AFZZVP

Replication Data for: realdata

Explore at:

Unique identifier

https://doi.org/10.7910/DVN/AFZZVP

Dataset updated

Nov 8, 2023

Dataset provided by

Harvard Dataverse

Authors

Xu, Ningning

Description

(1) dataandpathway_eisner.R, dataandpathway_bordbar.R, dataandpathway_taware.R and dataandpathway_almutawa.R: functions and codes to clean the realdata sets and obtain the annotation databases, which are save as .RData files in sudfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (2) FWER_excess.R: functions to show the inflation of FWER when integrating multiple annotation databases and to generate Table 1. (3) data_info.R: code to obtain Table 2 and Table 3. (4) rejections_perdataset.R and triangulartable.R: functions to generate Table 4. The runing time of rejections_perdataset.R is 7 hours around, we thus save the corresponding results as res_eisner.RData, res_bordbar.RData, res_taware.RData and res_almutawa.RData in subfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (5) pathwaysizerank.R: code for generating Figure 4 based on res_eisner.RData from (h). (6) iterationandtime_plot.R: code for generating Figure 5 based on “Al-Mutawa” data. The code is really time-consuming, nearly 5 days, we thus save the corresponding results and plot them in the main manuscript by pgfplot.

Clear search

Close search

Google apps

Main menu

Replication Data for: realdata

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Data release for solar-sensor angle analysis subset associated with the...

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

MW3-Dataset

Area Resource File (ARF)

R script to create boxplots of change factors by NOAA Atlas 14 station, or...

Replication Data for \"News from the Other Side: How Topic Relevance Limits...

Health and Retirement Study (HRS)

Data and code for "Plastic bag bans and fees reduce harmful bag litter on...

Long-term global vegetation and climate index datasets

R script to create boxplots of change factors by NOAA Atlas 14 station, or...

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

Data from: Shifting limitation of primary production: experimental support...

SDM env predictor comparison dataset

DQN Replay Dataset Dataset

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

Daily summaries of stream temperature at sites in Alaska, U.S. Geological...

Replication Data for: realdata