70 datasets found

f
Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
docx
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s004
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s004
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
f
Collection of example datasets used for the book - R Programming -...
figshare.com
txt
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24728073.v1
Dataset updated
Dec 4, 2023
Dataset provided by
figshare
Authors
Kingsley Okoye; Samira Hosseini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

d
Data from: spectre: An R package to estimate spatially-explicit community...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Oct 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Craig Eric Simpkins; Sebastian Hanß; Matthias Spangenberg; Jan Salecker; Maximilian Hesselbarth; Kerstin Wiegand (2022). spectre: An R package to estimate spatially-explicit community composition using sparse data [Dataset]. http://doi.org/10.5061/dryad.fbg79cnz7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.fbg79cnz7
Dataset updated
Oct 6, 2022
Dataset provided by
Dryad
Authors
Craig Eric Simpkins; Sebastian Hanß; Matthias Spangenberg; Jan Salecker; Maximilian Hesselbarth; Kerstin Wiegand
Time period covered
Sep 25, 2022
Description
The simulated community datasets were built using the virtualspecies V1.5.1 R package (Leroy et al., 2016), which generates spatially-explicit presence/absence matrices from habitat suitability maps. We simulated these suitability maps using Gaussian fields neutral landscapes produced using the NLMR V1.0 R package (Sciaini et al., 2018). To allow for some level of overlap between species suitability maps, we divided the γ-diversity (i.e., the total number of simulated species) by an adjustable correlation value to create several species groups that share suitability maps. Using a full factorial design, we developed 81 presence/absence maps varying across four axes (see Supplemental Table 1 and Supplemental Figure 1): 1) landscape size, representing the number of sites in the simulated landscape; 2) γ-diversity; 3) the level of correlation among species suitability maps, with greater correlations resulting in fewer shared species groups among suitability maps; and 4) the habitat suitabil...
c
Plankton measurements found in dataset OSD taken from the MIKHAIL LOMONOSOV...
s.cnmilf.com
data.cnra.ca.gov
+3more
Updated Jul 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2025). Plankton measurements found in dataset OSD taken from the MIKHAIL LOMONOSOV (R/V; call sign UQIH; built 1957; IMO5234955) and EKVATOR in the North Atlantic, Coastal N Atlantic and other locations from 1958 - 1959 (NCEI Accession 0052915) [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/plankton-measurements-found-in-dataset-osd-taken-from-the-mikhail-lomonosov-r-v-call-sign-uqih-
Explore at:
Dataset updated
Jul 1, 2025
Dataset provided by
(Point of Contact)
Description
Zooplankton biomass data collected from North Atlantic Ocean in 1958 - 1959 years received from NMFS.
Data from: Bike Sharing Dataset
kaggle.com
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ram Vishnu R
Description
Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.

How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

Business Goal:

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

Data Preparation:

You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.

You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

Model Building:

In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

Model Evaluation:

When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
Dataset of Indoor and Built Environment Publication in 2016, Laboratory...
catalog.data.gov
datasets.ai
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Dataset of Indoor and Built Environment Publication in 2016, Laboratory evaluation of polychlorinated biphenyls encapsulation methods [Dataset]. https://catalog.data.gov/dataset/dataset-of-indoor-and-built-environment-publication-in-2016-laboratory-evaluation-of-polyc
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The data presented in this data file is a product of a journal publication. The dataset contains PCB sorption concentrations on encapsulants, PCB concentrations in the air and in wipe samples, model simulation of the PCB concentration gradient in the source and encapsulant layers on exposed surfaces of encapsulants and in room air at different times, the ranking of encapsulants’ performance. This dataset is associated with the following publication: Liu , X., Z. Guo, K. Krebs , N. Roache, R. Stinson, J. Nardin, R. Pope, C. Mocka, and R. Logan. Laboratory evaluation of PCBs encapsulation method. Indoor and Built Environment. Sage Publications, THOUSAND OAKS, CA, USA, 25(6): 895-915, (2016).
d
Temperature, salinity and other measurements found in datasets XBT and CTD...
catalog.data.gov
s.cnmilf.com
Updated Jul 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2025). Temperature, salinity and other measurements found in datasets XBT and CTD taken from the MIRAI (R/V; call sign JNSR; built 1972 as Mutsu; renamed on 1996-02-02) in the North Pacific, Coastal N Pacific and other locations in 1999 (NCEI Accession 0000857) [Dataset]. https://catalog.data.gov/dataset/temperature-salinity-and-other-measurements-found-in-datasets-xbt-and-ctd-taken-from-the-mirai-
Explore at:
Dataset updated
Jul 1, 2025
Dataset provided by
(Point of Contact)
Description
Temperature, salinity and other measurements found in datasets XBT and CTD taken from the MIRAI (R/V; call sign JNSR; built 1972 as Mutsu; renamed on 02/02/1996) in the North Pacific, Coastal N Pacific and other locations in 1999 (NODC Accession 0000857). Data were submitted by the Japan Agency for Marine-Earth Science and Technology (JAMSTEC).
[Superseded] Intellectual Property Government Open Data 2019
researchdata.edu.au
data.gov.au
Updated Jun 6, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IP Australia (2019). [Superseded] Intellectual Property Government Open Data 2019 [Dataset]. https://researchdata.edu.au/superseded-intellectual-property-data-2019/2994670
Explore at:
Dataset updated
Jun 6, 2019
Dataset provided by
Data.govhttps://data.gov/
Authors
IP Australia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
What is IPGOD?\r

The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.\r \r \r

How do I use IPGOD?\r

IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.\r \r \r

IP Data Platform\r

IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform\r \r

References\r

\r The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.\r \r * Patents\r * Trade Marks\r * Designs\r * Plant Breeder’s Rights\r \r \r

Updates\r

\r

Tables and columns\r

\r Due to the changes in our systems, some tables have been affected.\r \r * We have added IPGOD 225 and IPGOD 325 to the dataset!\r * The IPGOD 206 table is not available this year.\r * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use.\r \r

Data quality improvements\r

\r Data quality has been improved across all tables.\r \r * Null values are simply empty rather than '31/12/9999'.\r * All date columns are now in ISO format 'yyyy-mm-dd'.\r * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0.\r * All tables are encoded in UTF-8.\r * All tables use the backslash \ as the escape character.\r * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.
Data from: AgrImOnIA: Open Access dataset correlating livestock and air...
zenodo.org
explore.openaire.eu
+1more
bin, csv
Updated Feb 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessandro Fassò; Alessandro Fassò; Jacopo Rodeschini; Jacopo Rodeschini; Alessandro Fusta Moro; Alessandro Fusta Moro; Qendrim Shaboviq; Qendrim Shaboviq; Marco Vinciguerra; Paolo Maranzano; Paolo Maranzano; Michela Cameletti; Michela Cameletti; Francesco Finazzi; Francesco Finazzi; Natalia Golini; Natalia Golini; Rosaria Ignaccolo; Rosaria Ignaccolo; Philipp Otto; Philipp Otto; Marco Vinciguerra (2024). AgrImOnIA: Open Access dataset correlating livestock and air quality in the Lombardy region, Italy [Dataset]. http://doi.org/10.5281/zenodo.7956006
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7956006
Dataset updated
Feb 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alessandro Fassò; Alessandro Fassò; Jacopo Rodeschini; Jacopo Rodeschini; Alessandro Fusta Moro; Alessandro Fusta Moro; Qendrim Shaboviq; Qendrim Shaboviq; Marco Vinciguerra; Paolo Maranzano; Paolo Maranzano; Michela Cameletti; Michela Cameletti; Francesco Finazzi; Francesco Finazzi; Natalia Golini; Natalia Golini; Rosaria Ignaccolo; Rosaria Ignaccolo; Philipp Otto; Philipp Otto; Marco Vinciguerra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Lombardy, Italy
Description
The AgrImOnIA dataset is a comprehensive dataset relating air quality and livestock (expressed as the density of bovines and swine bred) along with weather and other variables. The AgrImOnIA Dataset represents the first step of the AgrImOnIA project. The purpose of this dataset is to give the opportunity to assess the impact of agriculture on air quality in Lombardy through statistical techniques capable of highlighting the relationship between the livestock sector and air pollutants concentrations.

The building process of the dataset is detailed in the companion paper:

A. Fassò, J. Rodeschini, A. Fusta Moro, Q. Shaboviq, P. Maranzano, M. Cameletti, F. Finazzi, N. Golini, R. Ignaccolo, and P. Otto (2023). Agrimonia: a dataset on livestock, meteorology and air quality in the Lombardy region, Italy. SCIENTIFIC DATA, 1-19.

available here.

This dataset is a collection of estimated daily values for a range of measurements of different dimensions as: air quality, meteorology, emissions, livestock animals and land use. Data are related to Lombardy and the surrounding area for 2016-2021, inclusive. The surrounding area is obtained by applying a 0.3° buffer on Lombardy borders.

The data uses several aggregation and interpolation methods to estimate the measurement for all days.

The files in the record, renamed according to their version (es. .._v_3_0_0), are:

Agrimonia_Dataset.csv(.mat and .Rdata) which is built by joining the daily time series related to the AQ, WE, EM, LI and LA variables. In order to simplify access to variables in the Agrimonia dataset, the variable name starts with the dimension of the variable, i.e., the name of the variables related to the AQ dimension start with 'AQ_'. This file is archived also in the format for MATLAB and R software.

Metadata_Agrimonia.csv which provides further information about the Agrimonia variables: e.g. sources used, original names of the variables imported, transformations applied.

Metadata_AQ_imputation_uncertainty.csv which contains the daily uncertainty estimate of the imputed observation for the AQ to mitigate missing data in the hourly time series.

Metadata_LA_CORINE_labels.csv which contains the label and the description associated with the CLC class.

Metadata_monitoring_network_registry.csv which contains all details about the AQ monitoring station used to build the dataset. Information about air quality monitoring stations include: station type, municipality code, environment type, altitude, pollutants sampled and other. Each row represents a single sensor.

Metadata_LA_SIARL_labels.csv which contains the label and the description associated with the SIARL class.

AGC_Dataset.csv(.mat and .Rdata) that includes daily data of almost all variables available in the Agrimonia Dataset (excluding AQ variables) on an equidistant grid covering the Lombardy region and its surrounding area.

The Agrimonia dataset can be reproduced using the code available at the GitHub page: https://github.com/AgrImOnIA-project/AgrImOnIA_Data

UPDATE 31/05/2023 - NEW RELEASE - V 3.0.0

A new version of the dataset is released: Agrimonia_Dataset_v_3_0_0.csv (.Rdata and .mat), where variable WE_rh_min, WE_rh_mean and WE_rh_max have been recomputed due to some bugs.

In addition, two new columns are added, they are LI_pigs_v2 and LI_bovine_v2 and represents the density of the pigs and bovine (expressed as animals per kilometer squared) of a square of size ~ 10 x 10 km centered at the station localisation.

A new dataset is released: the Agrimonia Grid Covariates (AGC) that includes daily information for the period from 2016 to 2020 of almost all variables within the Agrimonia Dataset on a equidistant grid containing the Lombardy region and its surrounding area. The AGC does not include AQ variables as they come from the monitoring stations that are irregularly spread over the area considered.

UPDATE 11/03/2023 - NEW RELEASE - V 2.0.2

A new version of the dataset is released: Agrimonia_Dataset_v_2_0_2.csv (.Rdata), where variable WE_tot_precipitation have been recomputed due to some bugs.

A new version of the metadata is available: Metadata_Agrimonia_v_2_0_2.csv where the spatial resolution of the variable WE_precipitation_t is corrected.

UPDATE 24/01/2023 - NEW RELEASE - V 2.0.1

minor bug fixed

UPDATE 16/01/2023 - NEW RELEASE - V 2.0.0

A new version of the dataset is released, Agrimonia_Dataset_v_2_0_0.csv (.Rdata) and Metadata_monitoring_network_registry_v_2_0_0.csv. Some minor points have been addressed:

Added values for LA_land_use variable for Switzerland stations (in Agrimonia Dataset_v_2_0_0.csv)

Deleted incorrect values for LA_soil_use variable for stations outside Lombardy region during 2018 (in Agrimonia Dataset_v_2_0_0.csv)

Fixed duplicate sensors corresponding to the same pollutant within the same station (in Metadata_monitoring_network_registry_v_2_0_0.csv)
C
Temperature, salinity and other measurements found in datasets XBT and CTD...
data.cnra.ca.gov
search.dataone.org
Updated May 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ocean Data Partners (2019). Temperature, salinity and other measurements found in datasets XBT and CTD taken from the MIRAI (R/V; call sign JNSR; built 1972 as Mutsu; renamed on 02/02/1996) in the North Pacific, Coastal N Pacific and other locations in 1999 (NODC Accession 0000857) [Dataset]. https://data.cnra.ca.gov/dataset/temperature-salinity-and-other-measurements-found-in-datasets-xbt-and-ctd-taken-from-the-mirai-
Explore at:
Dataset updated
May 9, 2019
Dataset authored and provided by
Ocean Data Partners
Description
Temperature, salinity and other measurements found in datasets XBT and CTD taken from the MIRAI (R/V; call sign JNSR; built 1972 as Mutsu; renamed on 02/02/1996)in the North Pacific, Coastal N Pacific and other locations in 1999 (NODC Accession 0000857). Data were submitted by the Japan Agency for Marine-Earth Science and Technology (JAMSTEC).
S
Sample of Yidu-N7K data set
scidb.cn
Updated Aug 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zengtao Jiao (2021). Sample of Yidu-N7K data set [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00095
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.j00104.00095
Dataset updated
Aug 31, 2021
Dataset provided by
Science Data Bank
Authors
Zengtao Jiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
[instructions for use] 1. This data set is manually edited by Yidu cloud medicine according to the real medical record distribution; 2. This dataset is an example of the yidu-n7k dataset on openkg. Yidu-n7k dataset can only be used for academic research of natural language processing, not for commercial purposes. ———————————————— Yidu-n4k data set is derived from chip 2019 evaluation task 1, that is, the data set of "clinical terminology standardization task". The standardization of clinical terms is an indispensable task in medical statistics. Clinically, there are often hundreds of different ways to write about the same diagnosis, operation, medicine, examination, test and symptoms. The problem to be solved in Standardization (normalization) is to find the corresponding standard statement for various clinical statements. With the basis of terminology standardization, researchers can carry out subsequent statistical analysis of EMR. In essence, the task of clinical terminology standardization is also a kind of semantic similarity matching task. However, due to the diversity of original word expressions, a single matching model is difficult to achieve good results. Yidu cloud, a leading medical artificial intelligence technology company in the industry, is also the first Unicorn company to drive medical innovation solutions with data intelligence. With the mission of "data intelligence and green medical care" and the goal of "improving the relationship between human beings and diseases", Yidu cloud uses data artificial intelligence to help the government, hospitals and the whole industry fully tap the intelligent political and civil value of medical big data, and build a big data ecological platform for the medical industry that can cover the whole country, make overall utilization and unified access. Since its establishment in 2013, Yidu cloud has gathered world-renowned scientists and the best people in the professional field to form a strong talent team. The company has invested hundreds of millions of yuan in R & D and service system establishment every year, built a medical data intelligent platform with large data processing capacity, high data integrity and transparent development process, and has obtained more than dozens of software copyrights and national invention patents.
r
2016 SoE Built environment Estimated car passenger kms per person capital...
researchdata.edu.au
cloud.csiss.gmu.edu
+2more
Updated Jul 26, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of the Environment (2016). 2016 SoE Built environment Estimated car passenger kms per person capital cities 1945 - 2013 [Dataset]. https://researchdata.edu.au/2016-soe-built-1945-2013/2980627
Explore at:
Dataset updated
Jul 26, 2016
Dataset provided by
data.gov.au
Authors
State of the Environment
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This data was sourced from the Bureau of Infrastructure, Transport. For more information see http://bitre.gov.au/publications/2014/is_059.aspx\r \r Figure BLT28 in Built environment. See https://soe.environment.gov.au/theme/built-environment/topic/2016/livability-transport#built-environment-figure-BLT28
d
Temperature measurements found in dataset XBT taken from the RMAS NEWTON...
catalog.data.gov
data.cnra.ca.gov
+2more
Updated Jul 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2025). Temperature measurements found in dataset XBT taken from the RMAS NEWTON (CALL SIGN GURN - launched 1975), DISCOVERY (R/V; call sign GLNE; built 12.1962; IMO5090660) and other platforms in the North Atlantic, Arctic and other locations from 1984 to 2000 (NCEI Accession 0000722) [Dataset]. https://catalog.data.gov/dataset/temperature-measurements-found-in-dataset-xbt-taken-from-the-rmas-newton-call-sign-gurn-launche1
Explore at:
Dataset updated
Jul 1, 2025
Dataset provided by
(Point of Contact)
Description
Release of ships of opportunity XBT data compiled by the Maritime Environment Information Centre at the UKHO. The data represents 5 years (1995 - 2000) of archived SOOP XBT observations. The dataset contains 7214 observations from various vessels (see vessel summary), the data ranges from years 1984 to 2000. Previously to 1995 the data was released annually to the oceanographic community through ICES and US NODC, World Data Centre A.
r
2016 SoE Built environment Upper projections of avoidable social costs of...
researchdata.edu.au
cloud.csiss.gmu.edu
+1more
Updated Jun 24, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of the Environment (2016). 2016 SoE Built environment Upper projections of avoidable social costs of congestion, 1990-2030 [Dataset]. https://researchdata.edu.au/2016-soe-built-1990-2030/2980606
Explore at:
Dataset updated
Jun 24, 2016
Dataset provided by
data.gov.au
Authors
State of the Environment
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Bureau of Infrastructure, Transport and Regional Economics data containing upper baseline projections of avoidable social costs of congestion, for the period 1990 to 2030\r \r For more information see http://bitre.gov.au/publications/2015/files/is_074.pdf\r \r Data used to produce Figure BLT6 in Built environment, SoE 2016. See https://soe.environment.gov.au/theme/built-environment/topic/2016/increased-traffic#built-environment-figure-BLT6
e
Replication Data for: Fostering Constructive Online News Discussions: The...
b2find.eudat.eu
Updated Mar 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Replication Data for: Fostering Constructive Online News Discussions: The Role of Sender Anonymity and Message Subjectivity in Shaping Perceived Polarization, Disinhibition, and Participation Intention in a Representative Sample of Online Commenters - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/2600accc-f7d6-54b0-b1b1-5f9cafbd560a
Explore at:
Dataset updated
Mar 20, 2025
Description
The materials and datasets accompanying the paper “Fostering Constructive Online News Discussions: The Role of Sender Anonymity and Message Subjectivity in Shaping Perceived Polarization, Disinhibition, and Participation Intention in a Representative Sample of Online Commenters”. In this paper we report on an experiment in which we aimed to reduce perceived polarization and increase intention to join online news discussions through manipulating sender anonymity and message subjectivity (i.e., explicit acknowledgements that a statement represents the writer’s perspective, e.g., “I think that is not true”). The data files are not stored in TiU Dataverse but are accessible via the LISS Data Archive. Data filesDataset_raw – SPSS raw datafile Dataset_restructured_coding incl – SPSS restructured data file from variables to cases, coding of participants’ comments has been included as an additional variable Dataset_backstructured_for MEMORE – SPSS backstructured data file from cases to variables in order to conduct the mediation analysis in MEMORE Coding participant comments – Excell file with the coding of participants comments by the R script, including the manual checking SPSS Syntax – SPSS syntax with which the variables were constructed in the Dataset R Script – R script for all the analyses, except the mediation because that was conducted in SPSS Supplemental material Questionnaire Design lists of stimuli Stimuli lists (1-4) Dutch words and phrases for automated subjectivity coding Structure data package From the raw dataset, we made the restructured dataset which also includes the calculated variables, see the SPSS Syntax. This structured dataset was the basis for the analyses in R. The backstructured dataset is based on the restructured dataset and needed for conducting the repeated measures mediation with SPSS MEMORE. The coding dataset was also analyzed in R, and provides the input for the column “CodingComments” in the restructured dataset. Method: Survey through the LISS panel Universe: The sample consisted of 302 participants, but after removing the 8 participants that had not completed the survey, the final sample consisted of 294 participants (Mage = 54.80, SDage = 15.53, range = 17 – 88 years; 55.4% male and 44.6% female). 3.1% of the sample completed only primary education, 25.6% reported high school as their highest completed education, 31.1% had attained secondary vocational education, 25.6% finished higher professional education, and 14.7% had a University degree as their highest qualification. Notably, whereas we preselected participants on their online activity, 49.7% of the sample indicated that they do not respond to online news articles anymore, suggesting that actual participation in online discussions fluctuates over time. Of the people that do react, 54.1% also engages in discussions in online news article threads. Of those, 8.8% discusses almost never, 45% multiple times per year, 35% multiple times per month, 10% multiple times per week, and 1.3% multiple times per day. Country/Nation: The Netherlands
h
r-braincels-instruct
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trent Kelly, r-braincels-instruct [Dataset]. https://huggingface.co/datasets/trentmkelly/r-braincels-instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Trent Kelly
Description
This is a chat formatted dataset of r/braincels posts. Each row in the JSONL contains one user message, which is a submission to r/braincels, and one assistant message, which is a reply to that submission. Built from this dataset
r
2016 SoE Built Environment passenger kilometres travelled by road in capital...
researchdata.edu.au
Updated Jul 20, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of the Environment (2016). 2016 SoE Built Environment passenger kilometres travelled by road in capital cities 1991 to 2014 [Dataset]. https://researchdata.edu.au/2016-soe-built-1991-2014/2980621
Explore at:
Dataset updated
Jul 20, 2016
Dataset provided by
data.gov.au
Authors
State of the Environment
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This data was sourced from the Bureau of Infrastructure, Transport and Regional Economics.\r For more information see https://bitre.gov.au/publications/2015/files/BITRE_yearbook_2015_full_report.pdf\r \r Dataset used to produce Figure BLT25 in Built environment. See; https://soe.environment.gov.au/theme/built-environment/topic/2016/livability-transport#built-environment-figure-BLT25\r
NBA WNBA play-by-play and shots data
kaggle.com
zip
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vladislav Shufinskiy (2025). NBA WNBA play-by-play and shots data [Dataset]. https://www.kaggle.com/datasets/brains14482/nba-playbyplay-and-shotdetails-data-19962021
Explore at:
zip(1683596108 bytes)Available download formats
Dataset updated
Jun 26, 2025
Authors
Vladislav Shufinskiy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description

NBA anba WNBA dataset is a large-scale play-by-play and shot-detail dataset covering both NBA and WNBA games, collected from multiple public sources (e.g., official league APIs and stats sites). It provides every in-game event—from period starts, jump balls, fouls, turnovers, rebounds, and field-goal attempts through free throws—along with detailed shot metadata (shot location, distance, result, assisting player, etc.).

Also you can download dataset from github or GoogleDrive

Tutorials

NBA play-by-play dataset R example

I will be grateful for ratings and stars on github, but the best gratitude is use of dataset for your projects.

Useful links:

nba-on-court: package for work with NBA and WNBA play-by-play data

Ryan Davis: Analyze the Play by Play Data

Python nba_api package for work with NBA API: https://github.com/swar/nba_api

R hoopR package for work with NBA API: https://hoopr.sportsdataverse.org/

Motivation

I made this dataset because I want to simplify and speed up work with play-by-play data so that researchers spend their time studying data, not collecting it. Due to the limits on requests on the NBA and WNBA website, and also because you can get play-by-play of only one game per request, collecting this data is a very long process.

Using this dataset, you can reduce the time to get information about one season from a few hours to a couple of seconds and spend more time analyzing data or building models.

I also added play-by-play information from other sources: pbpstats.com, data.nba.com, cdnnba.com. This data will enrich information about the progress of each game and hopefully add opportunities to do interesting things.

Contact Me

If you have any questions or suggestions about the dataset, you can write to me in a convenient channel for you:

LinkedIn

GIthub

X

Telegram
w
Local Plan 2004 Built-up Areas within Green Belt
data.wu.ac.at
cloud.csiss.gmu.edu
+1more
ashx, wms
Updated Jun 20, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wycombe District Council (2018). Local Plan 2004 Built-up Areas within Green Belt [Dataset]. https://data.wu.ac.at/odso/data_gov_uk/MjUwZmE2MjQtNWU4MS00YmExLWFiN2QtNDQ5M2I1NWNkNDVk
Explore at:
wms, ashxAvailable download formats
Dataset updated
Jun 20, 2018
Dataset provided by
Wycombe District Council
Area covered
c50b3c180057873215c1ca3158d954c44ce3c3bb
Description
Built-up areas within the Green Belt relevant to Policy GB4 of the Adopted Local Plan 2004, polygons

Facebook

Twitter

Click to copy link

Link copied

Cite

Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s004

Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.3389/feduc.2024.1379910.s004

Dataset updated

Mar 22, 2024

Dataset provided by

Frontiers

Authors

Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

Clear search

Close search

Google apps

Main menu

Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene...

Collection of example datasets used for the book - R Programming -...

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Data from: spectre: An R package to estimate spatially-explicit community...

Plankton measurements found in dataset OSD taken from the MIKHAIL LOMONOSOV...

Data from: Bike Sharing Dataset

Problem Statement:

Business Goal:

Data Preparation:

Model Building:

Model Evaluation:

Dataset of Indoor and Built Environment Publication in 2016, Laboratory...

Temperature, salinity and other measurements found in datasets XBT and CTD...

[Superseded] Intellectual Property Government Open Data 2019

What is IPGOD?\r

How do I use IPGOD?\r

IP Data Platform\r

References\r

Updates\r

Tables and columns\r

Data quality improvements\r

Data from: AgrImOnIA: Open Access dataset correlating livestock and air...

Temperature, salinity and other measurements found in datasets XBT and CTD...

Sample of Yidu-N7K data set

2016 SoE Built environment Estimated car passenger kms per person capital...

Temperature measurements found in dataset XBT taken from the RMAS NEWTON...

2016 SoE Built environment Upper projections of avoidable social costs of...

Replication Data for: Fostering Constructive Online News Discussions: The...

r-braincels-instruct

2016 SoE Built Environment passenger kilometres travelled by road in capital...

NBA WNBA play-by-play and shots data

Description

Motivation

Contact Me

Local Plan 2004 Built-up Areas within Green Belt

Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docxSee More Versions

Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx