36 datasets found

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

f
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
R-code, Dataset, Analysis and output (2012-2020): Occupancy and Probability...
catalog.data.gov
datasets.ai
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). R-code, Dataset, Analysis and output (2012-2020): Occupancy and Probability of Detection for Bachman's Sparrow (Aimophila aestivalis), Northern Bobwhite (Collinus virginianus), and Brown-headed Nuthatch (Sitta pusilla) to Habitat Management Practices on Carolina Sandhills NWR [Dataset]. https://catalog.data.gov/dataset/r-code-dataset-analysis-and-output-2012-2020-occupancy-and-probability-of-detection-for-ba
Explore at:
Dataset updated
Feb 22, 2025
Dataset provided by
U.S. Fish and Wildlife Servicehttp://www.fws.gov/
Description
This reference contains the R-code for the analysis and summary of detections of Bachman's sparrow, bobwhite quail and brown-headed nuthatch through 2020. Specifically generates probability of detection and occupancy of the species based on call counts and elicited calls with playback. The code loads raw point count (CSV files) and fire history data (CSV) and cleans/transforms into a tidy format for occupancy analysis. It then creates the necessary data structure for occupancy analysis, performs the analysis for the three focal species, and provides functionality for generating tables and figures summarizing the key findings of the occupancy analysis. The raw data, point count locations and other spatial data (ShapeFiles) are contained in the dataset.
Dataset and images for "Instantaneous R calculation for COVID-19 epidemic in...
zenodo.org
bin, csv, png
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Félix. Francisco H. C.; Félix. Francisco H. C.; Juvenia Bezerra Fontenele; Juvenia Bezerra Fontenele (2024). Dataset and images for "Instantaneous R calculation for COVID-19 epidemic in Brazil" [Dataset]. http://doi.org/10.5281/zenodo.3819284
Explore at:
png, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3819284
Dataset updated
Jul 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Félix. Francisco H. C.; Félix. Francisco H. C.; Juvenia Bezerra Fontenele; Juvenia Bezerra Fontenele
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was generated from raw data obtained at

Ceará State - https://indicadores.integrasus.saude.ce.gov.br/api/casos-coronavirus/export-csv

São Paulo State - http://www.seade.gov.br/wp-content/uploads/2020/04/Dados-covid-19-estado.csv

Brazil - https://covid.saude.gov.br/

Data was processed with R package EpiEstim (methodology in the associated preprint). Briefly, instantaneous R was estimated within a 5 day time window. Prior mean and standard deviation values for R were set at 3 and 1. Serial interval was estimated using a parametric distribution with uncertainty (offset gamma). We compared the results at two time points (day 7 and day 21 after the first case was registered at each region) from different brazillian states in order to make inferences about the epidemic dynamics.
D
Data set for reproducing plots showing stable water isotopologue transport...
darus.uni-stuttgart.de
Updated Oct 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefanie Kiemle; Katharina Heck (2022). Data set for reproducing plots showing stable water isotopologue transport and fractionation [Dataset]. http://doi.org/10.18419/DARUS-3108
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-3108
Dataset updated
Oct 6, 2022
Dataset provided by
DaRUS
Authors
Stefanie Kiemle; Katharina Heck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
DFG
Description
This data set includes the *.csv data and the used scripts to reproduce the plots of the three different scenarios presented in S. Kiemle, K. Heck, E. Coltman, R. Helmig (2022) Stable water isotopologue fractionation during soil-water evaporation: Analysis using a coupled soil-atmosphere model. (Under review) Water Resources Research. *.csv files The isotope distribution has been analyzed in the vertical and in horizontal direction of a soil column for all scenarios. Therefore, we provide *.csv files generated using the ParaView Tools "plot over line" or "plot over time". Each *.csv file contains information about the saturation, temperature, and component composition for each phase in mole fraction or in the isotopic-specific delta notation. Additionally, information about the evaporation rate is given in a separate file *.txt file. python scripts For each scenario, we provide scripts to reproduce the presented plots. Scenarios We used different free flow conditions to analyze the fractionation processes inside the porous medium. Scenario 1. laminar flow, Scenario 2. laminar flow, but with isolation of parameter affecting the fractionation process, Scenario 3. turbulent flow. Please find below a detailed description of the data labeling and needed scripts to reproduce a certain plot for each scenario. Scenario: The spatial distribution of stable water isotopologues in horizontal (-0.01 m depth) and vertical (at 0.05 m width) inside a soil column at selected days (DoE (Day of Experiment)): Use the python scripts plot_concentration_horizontal_all.py (horizontal direction) and plot_concentration_spatial_all.py (vertical direction) to create the specific plots. In the folder IsotopeProfile_Horizontal and IsotopeProfile_Vertical the belonging *.csv can be found. The *.csv files are named after the selected day (e.g. DoE_80 refers to day 80 of the virtual experiment). The influence of the evaporation rate on isotopic fractionation processes in various depths (-0.001, -0.005, -0.009, and -0.018 m ) during the whole virtual experiment time: Use the python script plot_evap_isotopes_v2.py to create the plots. The data for the isotopologues distribution and the saturation can be found in the folder PlotOverTime. All data is named as PlotOverTime_xxxxm with xxxx representing the respective depth (e.g. PlotOverTime_0.001m refers to -0.001 m depth). The data for the evaporation rate can be found in the folder EvaporationRate. Note, the evaporation rate data is available as a .txt because we extract the information about the evaporation directly during the simulation and do not derive it through any post-processing. Scenario: Process behavior of isolated parameters that influences the isotopic fractionation: Use plot_concentration.py to reproduce the plots either represented in the isotopic-specific delta notation or in mole fraction. The corresponding data can be found in the folder IsotopeProfile_Vertical. The data labeling refers to the single cases (1- no fractionation; 2 - only equilibrium fractionation; 3 - only kinetic fractionation; 4 - only liquid diffusion; 5 - Reference). Scenario: Evaporation rate during the virtual experiment for different flow cases: With plot_evap.py and the .txt files which can be found in the folder EvaporationRate, the evaporation progression can be plotted. The labeling of the .txt files refers to the different flow cases (1 - 0.1 m/s (laminar); 2 - 0.13 m/s (laminar); 3 - 0.5 m/s (turbulent); 4 - 1 m/s (turbulent); 5 - 3 m/s (turbulent)). The isotope profiles in the vertical and horizontal direction of the soil column (similar to Scenario 1) for selected days: With plot_cocentration_horizontal_all.py and plot_concentration_spatial_all.py the plots for the horizontal and vertical distribution of isotopologues can be generated. The corresponding data can be found in the folders IsotopeProfile_Horizontal and IsotopeProfile_Vertical. These folders are structured with subfolders containing the data of selected days of the virtual experiments (DoE - Day of Experiments), in this case, day 2, 10, and 35. The data labeling remains similar to Scenario 3a).
Example of how to manually extract incubation bouts from interactive plots...
figshare.com
txt
Updated Jan 22, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Bulla (2016). Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA [Dataset]. http://doi.org/10.6084/m9.figshare.2066784.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.2066784.v1
Dataset updated
Jan 22, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Martin Bulla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}
d
Data to Assess Nitrogen Export from Forested Watersheds in and near the Long...
catalog.data.gov
data.usgs.gov
Updated Mar 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data to Assess Nitrogen Export from Forested Watersheds in and near the Long Island Sound Basin with Weighted Regressions on Time, Discharge, and Season (WRTDS) [Dataset]. https://catalog.data.gov/dataset/data-to-assess-nitrogen-export-from-forested-watersheds-in-and-near-the-long-island-sound-
Explore at:
Dataset updated
Mar 11, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Long Island Sound, Long Island
Description
The U.S. Geological Survey, in cooperation with the U.S. Environmental Protection Agency's Long Island Sound Study (https://longislandsoundstudy.net), characterized nitrogen export from forested watersheds and whether nitrogen loading has been increasing or decreasing to help inform Long Island Sound management strategies. The Weighted Regressions on Time, Discharge, and Season (WRTDS; Hirsch and others, 2010) method was used to estimate annual concentrations and fluxes of nitrogen species using long-term records (14 to 37 years in length) of stream total nitrogen, dissolved organic nitrogen, nitrate, and ammonium concentrations and daily discharge data from 17 watersheds located in the Long Island Sound basin or in nearby areas of Massachusetts, New Hampshire, or New York. This data release contains the input water-quality and discharge data, annual outputs (including concentrations, fluxes, yields, and confidence intervals about these estimates), statistical tests for trends between the periods of water years 1999-2000 and 2016-2018, and model diagnostic statistics. These datasets are organized into one zip file (WRTDSeLists.zip) and six comma-separated values (csv) data files (StationInformation.csv, AnnualResults.csv, TrendResults.csv, ModelStatistics.csv, InputWaterQuality.csv, and InputStreamflow.csv). The csv file (StationInformation.csv) contains information about the stations and input datasets. Finally, a short R script (SampleScript.R) is included to facilitate viewing the input and output data and to re-run the model. Reference: Hirsch, R.M., Moyer, D.L., and Archfield, S.A., 2010, Weighted Regressions on Time, Discharge, and Season (WRTDS), with an application to Chesapeake Bay River inputs: Journal of the American Water Resources Association, v. 46, no. 5, p. 857–880.
d
Input data, model output, and R scripts for a machine learning streamflow...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17 [Dataset]. https://catalog.data.gov/dataset/input-data-model-output-and-r-scripts-for-a-machine-learning-streamflow-model-on-the-wyomi
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Wyoming, Wyoming Range
Description
A machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 predictions of spatially and temporally continuous monthly streamflow. Additional information about the datasets is in the metadata included in the four zipped dataset files and about the MLFLOW model is in the readme included in the zipped model archive folder.
Game by Game MLB Batter Data (2017-2020)
kaggle.com
Updated Aug 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Adamek (2022). Game by Game MLB Batter Data (2017-2020) [Dataset]. https://www.kaggle.com/datasets/johnadamek/game-by-game-mlb-batter-data-20172020
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 5, 2022
Dataset provided by
Kaggle
Authors
John Adamek
Description
Content

This dataset utilized raw data from Advanced Sports Analytics (https://www.advancedsportsanalytics.com/).

This is a great website that provides raw MLB game data for every game. It is quite messy and requires a quite a bit cleaning but the data is worth it! Batting, Pitching, and play by play data was exported into csv files for the 2017-2020 seasons. R script is provided

Columns

Key Column information:

Batting Order = Where the player batted in the lineup for that given day Position = The position they played for that game Pit = Total amount of pitches they saw over the course of the game Str = Total amount of strikes they saw over the course of the game Team.R = Total runs scored by the batters team in the game Team.H = Total hits by the batters team in the game Opponent.R = Total runs scored by the opposing team in the game Opponent.H = Total hits by the opposing team in the game X1b.Ump = First base umpire for the game X2b.Ump = Second base umpire for the game X3b.Ump = Third base umpire for the game HP.Ump = Home Plate umpire for the game Date = Date of the game Game.Time = Game time H.A = Home or Away Precipitation = yes/no Sky = Whether it was sunny, cloudy, overcast, rain, drizzle, night, or in dome Stadium = Stadium played in Temperature = Temperature at game time Weather = Character combining temperature, wind speed, wind direction, and stadium/sky ** Wind.Direction** = Direction of the wind speed Wind.Speed = Wind speed in mph Starting.Pitcher = Starting pitcher Over.Under = Over/Under of the game Moneyline = The moneyline for the batters team Wagers = Amount of wagers placed on the game

UPDATE

Unfortunately, it seems like they no longer have this raw data available on their website so I will be uploading the raw data along with the cleaned files so that other's can manipulate the data anyway they like!
Z
Food and Agriculture Biomass Input–Output (FABIO) database
data.niaid.nih.gov
zenodo.org
Updated Jun 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bruckner, Martin (2022). Food and Agriculture Biomass Input–Output (FABIO) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2577066
Explore at:
Dataset updated
Jun 8, 2022
Dataset provided by
Bruckner, Martin
Kuschnig, Nikolas
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This data repository provides the Food and Agriculture Biomass Input Output (FABIO) database, a global set of multi-regional physical supply-use and input-output tables covering global agriculture and forestry.

The work is based on mostly freely available data from FAOSTAT, IEA, EIA, and UN Comtrade/BACI. FABIO currently covers 191 countries + RoW, 118 processes and 125 commodities (raw and processed agricultural and food products) for 1986-2013. All R codes and auxilliary data are available on GitHub. For more information please refer to https://fabio.fineprint.global.

The database consists of the following main components, in compressed .rds format:

Z: the inter-commodity input-output matrix, displaying the relationships of intermediate use of each commodity in the production of each commodity, in physical units (tons). The matrix has 24000 rows and columns (125 commodities x 192 regions), and is available in two versions, based on the method to allocate inputs to outputs in production processes: Z_mass (mass allocation) and Z_value (value allocation). Note that the row sums of the Z matrix (= total intermediate use by commodity) are identical in both versions.

Y: the final demand matrix, denoting the consumption of all 24000 commodities by destination country and final use category. There are six final use categories (yielding 192 x 6 = 1152 columns): 1) food use, 2) other use (non-food), 3) losses, 4) stock addition, 5) balancing, and 6) unspecified.

X: the total output vector of all 24000 commodities. Total output is equal to the sum of intermediate and final use by commodity.

L: the Leontief inverse, computed as (I – A)-1, where A is the matrix of input coefficients derived from Z and x. Again, there are two versions, depending on the underlying version of Z (L_mass and L_value).

E: environmental extensions for each of the 24000 commodities, including four resource categories: 1) primary biomass extraction (in tons), 2) land use (in hectares), 3) blue water use (in m3)., and 4) green water use (in m3).

mr_sup_mass/mr_sup_value: For each allocation method (mass/value), the supply table gives the physical supply quantity of each commodity by producing process, with processes in the rows (118 processes x 192 regions = 22656 rows) and commodities in columns (24000 columns).

mr_use: the use table capture the quantities of each commodity (rows) used as an input in each process (columns).

A description of the included countries and commodities (i.e. the rows and columns of the Z matrix) can be found in the auxiliary file io_codes.csv. Separate lists of the country sample (including ISO3 codes and continental grouping) and commodities (including moisture content) are given in the files regions.csv and items.csv, respectively. For information on the individual processes, see auxiliary file su_codes.csv. RDS files can be opened in R. Information on how to read these files can be obtained here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS

Except of X.rds, which contains a matrix, all variables are organized as lists, where each element contains a sparse matrix. Please note that values are always given in physical units, i.e. tonnes or head, as specified in items.csv. The suffixes value and mass only indicate the form of allocation chosen for the construction of the symmetric IO tables (for more details see Bruckner et al. 2019). Product, process and country classifications can be found in the file fabio_classifications.xlsx.

Footprint results are not contained in the database but can be calculated, e.g. by using this script: https://github.com/martinbruckner/fabio_comparison/blob/master/R/fabio_footprints.R

How to cite:

To cite FABIO work please refer to this paper:

Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., Börner, J. 2019. FABIO – The Construction of the Food and Agriculture Input–Output Model. Environmental Science & Technology 53(19), 11302–11312. DOI: 10.1021/acs.est.9b03554

License:

This data repository is distributed under the CC BY-NC-SA 4.0 License. You are free to share and adapt the material for non-commercial purposes using proper citation. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. In case you are interested in a collaboration, I am happy to receive enquiries at martin.bruckner@wu.ac.at.

Known issues:

The underlying FAO data have been manipulated to the minimum extent necessary. Data filling and supply-use balancing, yet, required some adaptations. These are documented in the code and are also reflected in the balancing item in the final demand matrices. For a proper use of the database, I recommend to distribute the balancing item over all other uses proportionally and to do analyses with and without balancing to illustrate uncertainties.
n
2007-08 V3 CEAMARC-CASO Bathymetry Plots Over Time During Events
cmr.earthdata.nasa.gov
researchdata.edu.au
Updated Sep 5, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). 2007-08 V3 CEAMARC-CASO Bathymetry Plots Over Time During Events [Dataset]. http://doi.org/10.4225/15/59ae2f5b239c2
Explore at:
Unique identifier
https://doi.org/10.4225/15/59ae2f5b239c2
Dataset updated
Sep 5, 2017
Time period covered
Dec 17, 2007 - Jan 26, 2008
Area covered

Description
A routine was developed in R ('bathy_plots.R') to plot bathymetry data over time during individual CEAMARC events. This is so we can analyse benthic data in relation to habitat, ie. did we trawl over a slope or was the sea floor relatively flat. Note that the depth range in the plots is autoscaled to the data, so a small range in depths appears as a scatetring of points. As long as you look at the depth scale though interpretation will be ok.

The R files need a file of bathymetry data in '200708V3_one_minute.csv' which is a file containing a data export from the underway PostgreSQL ship database and 'events.csv' which is a stripped down version of the events export from the ship board events database export. If you wish to run the code again you may need to change the pathnames in the R script to relevant locations. If you have opened the csv files in excel at any stage and the R script gets an error you may need to format the date/time columns as yyyy-mm-dd hh;mm:ss, save and close the file as csv without opening it again and then run the R script.

However, all output files are here for every CEAMARC event. Filenames contain a reference to CEAMARC event id. Files are in eps format and can be viewed using Ghostview which is available as a free download on the internet.
n
ESG rating of general stock indices
narcis.nl
data.mendeley.com
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erhart, S (via Mendeley Data) (2021). ESG rating of general stock indices [Dataset]. http://doi.org/10.17632/58mwkj5pf8.1
Explore at:
Unique identifier
https://doi.org/10.17632/58mwkj5pf8.1
Dataset updated
Oct 22, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Erhart, S (via Mendeley Data)
Description
################################################################################################## THE FILES HAVE BEEN CREATED BY SZILÁRD ERHART FOR A RESEARCH: ERHART (2021): ESG RATINGS OF GENERAL # STOCK EXCHANGE INDICES, INTERNATIONAL REVIEW OF FINANCIAL ANALYSIS# USERS OF THE FILES AGREE TO QUOTE THE ABOVE PAPER# THE PYTHON SCRIPT (PYTHONESG_ERHART.TXT) HELPS USERS TO GET TICKERS BY STOCK EXCHANGES AND EXTRACT ESG SCORES FOR THE UNDERLYING STOCKS FROM YAHOO FINANCE.# THE R SCRIPT (ESG_UA.TXT) HELPS TO REPLICATE THE MONTE CARLO EXPERIMENT DETAILED IN THE STUDY.# THE EXPORT_ALL CSV CONTAINS THE DOWNLOADED ESG DATA (SCORES, CONTROVERSIES, ETC) ORGANIZED BY STOCKS AND EXCHANGES.############################################################################################################################################################################################################### DISCLAIMER # The author takes no responsibility for the timeliness, accuracy, completeness or quality of the information provided. # The author is in no event liable for damages of any kind incurred or suffered as a result of the use or non-use of the # information presented or the use of defective or incomplete information. # The contents are subject to confirmation and not binding. # The author expressly reserves the right to alter, amend, whole and in part, # without prior notice or to discontinue publication for a period of time or even completely. ###########################################################################################################################################READ ME############################################################# BEFORE USING THE MONTE CARLO SIMULATIONS SCRIPT: # (1) COPY THE goascores.csv and goalscores_alt.csv FILES ONTO YOUR ON COMPUTER DRIVE. THE TWO FILES ARE IDENTICAL.# (2) SET THE EXACT FILE LOCATION INFORMATION IN THE 'Read in data' SECTION OF THE MONTE CARLO SCRIPT AND FOR THE OUTPUT FILES AT THE END OF THE SCRIPT# (3) LOAD MISC TOOLS AND MATRIXSTATS IN YOUR R APPLICATION# (4) RUN THE CODE.####################################READ ME
d
Data from: Data and code from: Stem borer herbivory dependent on...
catalog.data.gov
agdatacommons.nal.usda.gov
+2more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-stem-borer-herbivory-dependent-on-interactions-of-sugarcane-variety-ass-1e076
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains all the data and code needed to reproduce the analyses in the manuscript: Penn, H. J., & Read, Q. D. (2023). Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage. Pest Management Science. https://doi.org/10.1002/ps.7843 Included are two .Rmd notebooks containing all code required to reproduce the analyses in the manuscript, two .html file of rendered notebook output, three .csv data files that are loaded and analyzed, and a .zip file of intermediate R objects that are generated during the model fitting and variable selection process. Notebook files 01_boring_analysis.Rmd: This RMarkdown notebook contains R code to read and process the raw data, create exploratory data visualizations and tables, fit a Bayesian generalized linear mixed model, extract output from the statistical model, and create graphs and tables summarizing the model output including marginal means for different varieties and contrasts between crop years. 02_trait_covariate_analysis.Rmd: This RMarkdown notebook contains R code to read raw variety-level trait data, perform feature selection based on correlations between traits, fit another generalized linear mixed model using traits as predictors, and create graphs and tables from that model output including marginal means by categorical trait and marginal trends by continuous trait. HTML files These HTML files contain the rendered output of the two RMarkdown notebooks. They were generated by Quentin Read on 2023-08-30 and 2023-08-15. 01_boring_analysis.html 02_trait_covariate_analysis.html CSV data files These files contain the raw data. To recreate the notebook output the CSV files should be at the file path project/data/ relative to where the notebook is run. Columns are described below. BoredInternodes_26April2022_no format.csv: primary data file with sugarcane borer (SCB) damage Columns A-C are the year, date, and location. All location values are the same. Column D identifies which experiment the data point was collected from. Column E, Stubble, indicates the crop year (plant cane or first stubble) Column F indicates the variety Column G indicates the plot (integer ID) Column H indicates the stalk within each plot (integer ID) Column I, # Internodes, indicates how many internodes were on the stalk Columns J-AM are numbered 1-30 and indicate whether SCB damage was observed on that internode (0 if no, 1 if yes, blank cell if that internode was not present on the stalk) Column AN indicates the experimental treatment for those rows that are part of a manipulative experiment Column AO contains notes variety_lookup.csv: summary information for the 16 varieties analyzed in this study Column A is the variety name Column B is the total number of stalks assessed for SCB damage for that variety across all years Column C is the number of years that variety is present in the data Column D, Stubble, indicates which crop years were sampled for that variety ("PC" if only plant cane, "PC, 1S" if there are data for both plant cane and first stubble crop years) Column E, SCB resistance, is a categorical designation with four values: susceptible, moderately susceptible, moderately resistant, resistant Column F is the literature reference for the SCB resistance value Select_variety_traits_12Dec2022.csv: variety-level traits for the 16 varieties analyzed in this study Column A is the variety name Column B is the SCB resistance designation as an integer Column C is the categorical SCB resistance designation (see above) Columns D-I are continuous traits from year 1 (plant cane), including sugar (Mg/ha), biomass or aboveground cane production (Mg/ha), TRS or theoretically recoverable sugar (g/kg), stalk weight of individual stalks (kg), stalk population density (stalks/ha), and fiber content of stalk (percent). Columns J-O are the same continuous traits from year 2 (first stubble) Columns P-V are categorical traits (in some cases continuous traits binned into categories): maturity timing, amount of stalk wax, amount of leaf sheath wax, amount of leaf sheath hair, tightness of leaf sheath, whether leaf sheath becomes necrotic with age, and amount of collar hair. ZIP file of intermediate R objects To recreate the notebook output without having to run computationally intensive steps, unzip the archive. The fitted model objects should be at the file path project/ relative to where the notebook is run. intermediate_R_objects.zip: This file contains intermediate R objects that are generated during the model fitting and variable selection process. You may use the R objects in the .zip file if you would like to reproduce final output including figures and tables without having to refit the computationally intensive statistical models. binom_fit_intxns_updated_only5yrs.rds: fitted brms model object for the main statistical model binom_fit_reduced.rds: fitted brms model object for the trait covariate analysis marginal_trends.RData: calculated values of the estimated marginal trends with respect to year and previous damage marginal_trend_trs.rds: calculated values of the estimated marginal trend with respect to TRS marginal_trend_fib.rds: calculated values of the estimated marginal trend with respect to fiber content Resources in this dataset:Resource Title: Sugarcane borer damage data by internode, 1993-2021. File Name: BoredInternodes_26April2022_no format.csvResource Title: Summary information for the 16 sugarcane varieties analyzed. File Name: variety_lookup.csvResource Title: Variety-level traits for the 16 sugarcane varieties analyzed. File Name: Select_variety_traits_12Dec2022.csvResource Title: RMarkdown notebook 2: trait covariate analysis. File Name: 02_trait_covariate_analysis.RmdResource Title: Rendered HTML output of notebook 2. File Name: 02_trait_covariate_analysis.htmlResource Title: RMarkdown notebook 1: main analysis. File Name: 01_boring_analysis.RmdResource Title: Rendered HTML output of notebook 1. File Name: 01_boring_analysis.htmlResource Title: Intermediate R objects. File Name: intermediate_R_objects.zip
d
Integrated Hourly Meteorological Database of 20 Meteorological Stations...
search.dataone.org
osti.gov
Updated Jan 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Boris Faybishenko; Dylan O'Ryan (2025). Integrated Hourly Meteorological Database of 20 Meteorological Stations (1981-2022) for Watershed Function SFA Hydrological Modeling [Dataset]. http://doi.org/10.15485/2502101
Explore at:
Unique identifier
https://doi.org/10.15485/2502101
Dataset updated
Jan 17, 2025
Dataset provided by
ESS-DIVE
Authors
Boris Faybishenko; Dylan O'Ryan
Time period covered
Jan 1, 1981 - Dec 31, 2022
Area covered

Description
This dataset contains (a) a script “R_met_integrated_for_modeling.R”, and (b) associated input CSV files: 3 CSV files per location to create a 5-variable integrated meteorological dataset file (air temperature, precipitation, wind speed, relative humidity, and solar radiation) for 19 meteorological stations and 1 location within Trail Creek from the modeling team within the East River Community Observatory as part of the Watershed Function Scientific Focus Area (SFA). As meteorological forcings varied across the watershed, a high-frequency database is needed to ensure consistency in the data analysis and modeling. We evaluated several data sources, including gridded meteorological products and field data from meteorological stations. We determined that our modeling efforts required multiple data sources to meet all their needs. As output, this dataset contains (c) a single CSV data file (*_1981-2022.csv) for each location (20 CSV output files total) containing hourly time series data for 1981 to 2022 and (d) five PNG files of time series and density plots for each variable per location (100 PNG files). Detailed location metadata is contained within the Integrated_Met_Database_Locations.csv file for each point location included within this dataset, obtained from Varadharajan et al., 2023 doi:10.15485/1660962. This dataset also includes (e) a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata and (f) a data dictionary (dd.csv) file that contains column/row headers used throughout the files along with a definition, units, and data type. Review the (g) ReadMe_Integrated_Met_Database.pdf file for additional details on the script, methods, and structure of the dataset. The script integrates Northwest Alliance for Computational Science and Engineering’s PRISM gridded data product, National Oceanic and Atmospheric Administration’s NCEP-NCAR Reanalysis 1 gridded data product (through the RCNEP R package, Kemp et al., doi:10.32614/CRAN.package.RNCEP), and analytical-based calculations. Further, this script downscales the input data into hourly frequency, which is necessary for the modeling efforts.
d
QA/QC-ed Groundwater Level Time Series in PLM-1 and PLM-6 Monitoring Wells,...
dataone.org
data.ess-dive.lbl.gov
+1more
Updated Feb 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Boris Faybishenko; Roelof Versteeg; Kenneth Williams; Rosemary Carroll; Wenming Dong; Tetsu Tokunaga; Dylan O'Ryan (2024). QA/QC-ed Groundwater Level Time Series in PLM-1 and PLM-6 Monitoring Wells, East River, Colorado (2016-2022) [Dataset]. http://doi.org/10.15485/1866836
Explore at:
Unique identifier
https://doi.org/10.15485/1866836
Dataset updated
Feb 8, 2024
Dataset provided by
ESS-DIVE
Authors
Boris Faybishenko; Roelof Versteeg; Kenneth Williams; Rosemary Carroll; Wenming Dong; Tetsu Tokunaga; Dylan O'Ryan
Time period covered
Nov 30, 2016 - Oct 13, 2022
Area covered

Description
This data set contains QA/QC-ed (Quality Assurance and Quality Control) water level data for the PLM1 and PLM6 wells. PLM1 and PLM6 are location identifiers used by the Watershed Function SFA project for two groundwater monitoring wells along an elevation gradient located along the lower montane life zone of a hillslope near the Pumphouse location at the East River Watershed, Colorado, USA. These wells are used to monitor subsurface water and carbon inventories and fluxes, and to determine the seasonally dependent flow of groundwater under the PLM hillslope. The downslope flow of groundwater in combination with data on groundwater chemistry (see related references) can be used to estimate rates of solute export from the hillslope to the floodplain and river. QA/QC analysis of measured groundwater levels in monitoring wells PLM-1 and PLM-6 included identification and flagging of duplicated values of timestamps, gap filling of missing timestamps and water levels, removal of abnormal/bad and outliers of measured water levels. The QA/QC analysis also tested the application of different QA/QC methods and the development of regular (5-minute, 1-hour, and 1-day) time series datasets, which can serve as a benchmark for testing other QA/QC techniques, and will be applicable for ecohydrological modeling. The package includes a Readme file, one R code file used to perform QA/QC, a series of 8 data csv files (six QA/QC-ed regular time series datasets of varying intervals (5-min, 1-hr, 1-day) and two files with QA/QC flagging of original data), and three files for the reporting format adoption of this dataset (InstallationMethods, file level metadata (flmd), and data dictionary (dd) files).QA/QC-ed data herein were derived from the original/raw data publication available at Williams et al., 2020 (DOI: 10.15485/1818367). For more information about running R code file (10.15485_1866836_QAQC_PLM1_PLM6.R) to reproduce QA/QC output files, see README (QAQC_PLM_readme.docx). This dataset replaces the previously published raw data time series, and is the final groundwater data product for the PLM wells in the East River. Complete metadata information on the PLM1 and PLM6 wells are available in a related dataset on ESS-DIVE: Varadharajan C, et al (2022). https://doi.org/10.15485/1660962. These data products are part of the Watershed Function Scientific Focus Area collection effort to further scientific understanding of biogeochemical dynamics from genome to watershed scales. 2022/09/09 Update: Converted data files using ESS-DIVE’s Hydrological Monitoring Reporting Format. With the adoption of this reporting format, the addition of three new files (v1_20220909_flmd.csv, V1_20220909_dd.csv, and InstallationMethods.csv) were added. The file-level metadata file (v1_20220909_flmd.csv) contains information specific to the files contained within the dataset. The data dictionary file (v1_20220909_dd.csv) contains definitions of column headers and other terms across the dataset. The installation methods file (InstallationMethods.csv) contains a description of methods associated with installation and deployment at PLM1 and PLM6 wells. Additionally, eight data files were re-formatted to follow the reporting format guidance (er_plm1_waterlevel_2016-2020.csv, er_plm1_waterlevel_1-hour_2016-2020.csv, er_plm1_waterlevel_daily_2016-2020.csv, QA_PLM1_Flagging.csv, er_plm6_waterlevel_2016-2020.csv, er_plm6_waterlevel_1-hour_2016-2020.csv, er_plm6_waterlevel_daily_2016-2020.csv, QA_PLM6_Flagging.csv). The major changes to the data files include the addition of header_rows above the data containing metadata about the particular well, units, and sensor description. 2023/01/18 Update: Dataset updated to include additional QA/QC-ed water level data up until 2022-10-12 for ER-PLM1 and 2022-10-13 for ER-PLM6. Reporting format specific files (v2_20230118_flmd.csv, v2_20230118_dd.csv, v2_20230118_InstallationMethods.csv) were updated to reflect the additional data. R code file (QAQC_PLM1_PLM6.R) was added to replace the previously uploaded HTML files to enable execution of the associated code. R code file (QAQC_PLM1_PLM6.R) and ReadMe file (QAQC_PLM_readme.docx) were revised to clarify where original data was retrieved from and to remove local file paths.
Integration of Slurry Separation Technology & Refrigeration Units: Air...
catalog.data.gov
Updated Jun 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.usaid.gov (2024). Integration of Slurry Separation Technology & Refrigeration Units: Air Quality - Particulate Matter [Dataset]. https://catalog.data.gov/dataset/integration-of-slurry-separation-technology-refrigeration-units-air-quality-particulate-ma-26bf1
Explore at:
Dataset updated
Jun 25, 2024
Dataset provided by
United States Agency for International Developmenthttp://usaid.gov/
Description
This is the raw particulate matter data. Each sheet (tab) is formatted to be exported as a .csv for use with the R-code (AQ-June20.R). In order for this code to work properly, it is important that this file remain intact. Do not change the column names or codes for data, for example. And to be safe, don’t even sort. One simple change in the excel file could make the code full of bugs.
g
Integration of Slurry Separation Technology & Refrigeration Units: Air...
gimi9.com
Updated Jun 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Integration of Slurry Separation Technology & Refrigeration Units: Air Quality - CO | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_integration-of-slurry-separation-technology-refrigeration-units-air-quality-co-b7d1e/
Explore at:
Dataset updated
Jun 25, 2024
Description
This is the carbon monoxide data. Each sheet (tab) is formatted to be exported as a .csv for use with the R-code (AQ-June20.R). In order for this code to work properly, it is important that this file remain intact. Do not change the column names or codes for data, for example. And to be safe, don’t even sort. Just in case. One simple change in the excel file could make the code full of bugs.
o
GENEActiv accelerometer files collected during the project entitled...
explore.openaire.eu
Updated Jun 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Wattelez; Émilie Paufique; Pierre-Yves Le Roux; Akila Nedjar-Guerre; Solange Ponidja; Paul Zongo; Christophe Serra-Mallol; Fabrice Wacalie; Stéphane Frayon; Olivier Galy (2024). GENEActiv accelerometer files collected during the project entitled "Cultures et comportements alimentaires de la jeunesse dans les pays francophones du Pacifique au XXIème siècle: exemple de la Nouvelle-Calédonie" [Eng: "Eating cultures and behaviors of young people in French-speaking Pacific countries in the 21st century: the example of New Caledonia"] (anonymized version - third part) [Dataset]. http://doi.org/10.5281/zenodo.12682659
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.12682659
Dataset updated
Jun 12, 2024
Authors
Guillaume Wattelez; Émilie Paufique; Pierre-Yves Le Roux; Akila Nedjar-Guerre; Solange Ponidja; Paul Zongo; Christophe Serra-Mallol; Fabrice Wacalie; Stéphane Frayon; Olivier Galy
Area covered
New Caledonia, French
Description
Description of the files Accelerometer .csv files Accelerometer .csv files are conversions from .bin files collected with GENEActiv accelerometer device and extracted with the GENEActiv software. They are 1 second epoch files. Each accelerometer .csv file contains the following columns: timestamp: start time of the 1 second epochxm: 1 second epoch average of the x accelerationym: 1 second epoch average of the y accelerationzm: 1 second epoch average of the z accelerationlightm: 1 second epoch average of the lighttempm: 1 second epoch average of the temperaturesvmgsum: 1 second epoch sum of the SVMgsdx: 1 second epoch standard deviation of the x accelerationsdy: 1 second epoch standard deviationof the y accelerationsdz: 1 second epoch standard deviationof the z accelerationid: name of the original .bin file Code for extraction and conversion The read_a_binFile_share.R is an example of R code that uses the GENEAread library to extract information from a .bin GENEActiv file, compute the support vector magnitude (SVMg) from x, y and z accelerations, group the information in a 1 second epoch record and save the result in a .csv file. Participant characteristics participantCharacteristics.csv file describes the participants with de-identified information. Here are the meaning of the columns: Participant ID: Participant ID for this studyParticipant status: Participant status in the study ("Adolescent" or "Adult")Age range: Age range of the adolescent participant ("Less than or equal to 12", "Between 13 and 14 inclusive" or "Greater than or equal to 15"); unknown for adultsSex: Sex of the participant ("Female" or "Male")GeneActiv file: GENEActiv file (with no extension) associated with the participantStarting Day: Starting day of data collection, i.e. first day the device is wornDay of removal: Day of device removal, i.e. last day the device is wornDOI of the non-anonymized repository: DOI of the repository where is stored the .bin fileDOI of the anonymized repository: DOI of the repository where is stored the accelerometer .csv file Consent Written consent was obtained from the children's parents before the start of the study. The project was first authorized by the Vice-Rectorate and then presented to the school directors. The targeted schools were contacted, and the project was then proposed to the teaching teams for their acceptance. Data collection Data were collected between July 2018 and April 2019, from 1060 school-going adolescents (10–16 years old) during class time. In each school, two classes per level (6th, 5th, 4th, 3rd) were chosen to respond to the anonymous questionnaire, which consisted of two parts lasting 30 minutes each and was carried out in two stages. About 30 participants (according to the number of devices available) per school were randomly selected to wear a GENEActiv accelerometer device for 5 to 7 consecutive days, or more. Consenting participants wore the device for 7 days. When a participant refused to wear the device, another random draw was made. A total of 211 adolescents accepted to wear an accelerometer device. Data readability, validity and conversion We were not able to get data from 5 adolescent devices because of device or record failure. In this dataset containing 231 files, 206 files are from adolescents and 25 are from adults (volunteer parents). Raw data files are readable with common software libraries, like the R package GENEAread as well as with the GENEActiv software. The current .csv data files were obtained thanks to the following steps: Data collection with GENEActiv accelerometer device: data is stored in devices Data extraction with the GENEActiv software: data is stored in .bin files Data conversion with the R package GENEAread: consider the code in the read_a_binFile_share.R in the current dataset [reading the file, computing support vector magnitude (SVMg) from x, y and z acceleration, grouping data in 1 second blocs (timestamp: min grouping; x, y and z axes: mean and standard deviation grouping; light and temperature: mean grouping; SVMg: sum grouping)] Saving the file in .csv format GENEActiv accelerometer .csv files converted with a 1 second epoch from raw GENEActiv .bin files recorded during the project entitled "Cultures et comportements alimentaires de la jeunesse dans les pays francophones du Pacifique au XXIème siècle: exemple de la Nouvelle-Calédonie" [en: "Eating cultures and behaviors of young people in French-speaking Pacific countries in the 21st century: the example of New Caledonia"]. Devices are 60-Hz triaxial accelerometers. This dataset also contains participantCharacteristics.csv that povides basic information about participants and read_a_binFile_share.R that is a short R code aiming at converting and saving accelerometer data from .bin files in 1 second epoch .csv files (consider the Methods section). Participant characteristics: 10 to 16 years old students and some parents. Number of participants: 231 (206 adole...
Z
Data from: # Replication code and data for: Tracking green space along...
data.niaid.nih.gov
zenodo.org
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Falchetta, Giacomo (2024). # Replication code and data for: Tracking green space along streets of world cities [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8001676
Explore at:
Dataset updated
Oct 11, 2024
Dataset provided by
Falchetta, Giacomo
Hammad, T. Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication code and data for: Tracking green space along streets of world citiesBy Giacomo Falchetta and Ahmed T. HammadPreprint: https://doi.org/10.21203/rs.3.rs-3916891/v1

To replicate the analysis, the results, and the figures of the paper:

Download input data from this Zenodo repository and code from Github https://github.com/giacfalk/urban_green_space_mapping_and_tracking

Optional data extraction steps (processed output data are already available in the Zenodo repository):

Adjust your working directory

Run [lines 4-11] of workflow/sourcer.R

Run the Javascript scripts written by the string_generator_training.R and string_generator_prediction.R files in Google Earth Engine (https://code.earthengine.google.com) and complete the export to Drive tasks to generate the output .csv files

Run workflow/sourcer.R [lines 15-46] to train the ML model and make predictions (including figures and tables replication)
Emissions-based MCMC chains for Hector emissions scenario paper
zenodo.org
bin, csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Vega-Westhoff; Ben Vega-Westhoff (2020). Emissions-based MCMC chains for Hector emissions scenario paper [Dataset]. http://doi.org/10.5281/zenodo.3354632
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3354632
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ben Vega-Westhoff; Ben Vega-Westhoff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These csvs contain MCMC chains and sampled subsets for emissions-based calibration of the Hector simple climate model (https://github.com/JGCRI/hector, DOI:10.5194/gmd-8-939-2015).

The calibrations use a version of Hector that includes the BRICK sea-level module (https://github.com/scrim-network/BRICK, DOI:10.5194/gmd-10-2741-2017). Hector with BRICK is available on my fork of the Hector model (https://github.com/bvegawe/hector/tree/dev_slr). The calibration process is also adapted from BRICK. The code used to produce these chains can be found at https://github.com/bvegawe/hector_probabilistic, DOI:10.5281/zenodo.3236411.

These four sets of MCMC chains were produced using hector_calib_driver.R. Inputs used to create each calibration are specified below:

emissions_05.csv: Rscript hector_calib_driver.wideDiff.R -f *output folder* -n 1000000 --endyear 2005 --np 10

emissions_09.csv: Rscript hector_calib_driver.wideDiff.R -f *output folder* -n 1000000 --endyear 2009 --np 10

emissions_ohc_05.csv: Rscript hector_calib_driver.wideDiff.R -f *output folder* -n 1000000 --endyear 2005 --np 10 --obs_set noTE_obs --model_set noTE_model

emissions_ohc_09.csv: Rscript hector_calib_driver.wideDiff.R -f *output folder* -n 1000000 --endyear 2009 --np 10 --obs_set noTE_obs --model_set noTE_model

Facebook

Twitter

Click to copy link

Link copied

Cite

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

Clear search

Close search

Google apps

Main menu

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

R-code, Dataset, Analysis and output (2012-2020): Occupancy and Probability...

Dataset and images for "Instantaneous R calculation for COVID-19 epidemic in...

Data set for reproducing plots showing stable water isotopologue transport...

Example of how to manually extract incubation bouts from interactive plots...

Data to Assess Nitrogen Export from Forested Watersheds in and near the Long...

Input data, model output, and R scripts for a machine learning streamflow...

Game by Game MLB Batter Data (2017-2020)

Content

Columns

UPDATE

Food and Agriculture Biomass Input–Output (FABIO) database

2007-08 V3 CEAMARC-CASO Bathymetry Plots Over Time During Events

ESG rating of general stock indices

Data from: Data and code from: Stem borer herbivory dependent on...

Integrated Hourly Meteorological Database of 20 Meteorological Stations...

QA/QC-ed Groundwater Level Time Series in PLM-1 and PLM-6 Monitoring Wells,...

Integration of Slurry Separation Technology & Refrigeration Units: Air...

Integration of Slurry Separation Technology & Refrigeration Units: Air...

GENEActiv accelerometer files collected during the project entitled...

Data from: # Replication code and data for: Tracking green space along...

Replication code and data for: Tracking green space along streets of world citiesBy Giacomo Falchetta and Ahmed T. HammadPreprint: https://doi.org/10.21203/rs.3.rs-3916891/v1

Emissions-based MCMC chains for Hector emissions scenario paper

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI EcosystemSee More Versions

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem