https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The training data for reimplementing inDecay and FORECasT.The fasta file records the guide RNA, strand, cut-site, and target sequence matched by OligoID.The indelgen folder contains the indelgen file for each OligoID. Each indelgen file records all possible indel events estimated based on the target sequences.Finally, there are five processed dataframe (really big csv). This dataframe contains all the observed events and event frequency.
Libraries Import:
Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:
Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:
Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:
Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:
Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:
Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:
Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:
Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:
Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Original and derived data products referenced in the original manuscript are provided in the data package.
Original data:
Table_1_source_papers.csv
: Papers that met review criteria and which are summarized in Table 1 of the manuscript.
Derived data:
change_livestock_country.csv:
A dataframe containing values used to generate Figure 4a in the manuscript.
country_avg_schist_wormy_world.csv
: A dataframe containing values used to generate Figure 3 in the manuscript.
kenya_precip_change_1951_2020.csv
: A dataframe containing values used to generate Figure 4b in the manuscript.
Data were derived from the following sources:
Ogutu, J. O., Piepho, H.-P., Said, M. Y., Ojwang, G. O., Njino, L. W., Kifugo, S. C., & Wargute, P. W. (2016). Extreme wildlife declines and concurrent increase in livestock numbers in Kenya: What are the causes? PloS ONE, 11(9), e0163249. https://doi.org/10.1371/journal.pone.0163249
London Applied & Spatial Epidemiology Research Group (LASER). (2023). Global Atlas of Helminth Infections: STH and Schistosomiasis [dataset]. London School of Hygiene and Tropical Medicine. https://lshtm.maps.arcgis.com/apps/webappviewer/index.html?id=2e1bc70731114537a8504e3260b6fbc0
World Bank Group. (2023). Climate Data & Projections—Kenya. Climate Change Knowledge Portal. https://climateknowledgeportal.worldbank.org/country/kenya/climate-data-projections
Based on the dblp XML file, this dataset consists on a CSV file that has been extracted using a python script. The dataset can be easily loaded in a Python Data Analysis Library dataframe.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a corpus of 56,416 unique privacy policy texts spanning the years 1996-2021.
policy-texts.zip contains a directory of text files with the policy texts. File names are the hashes of the policy text.
policy-metadata.zip contains two CSV files (can be imported into a pandas dataframe) with policy metadata including readability measures for each policy text.
labeled-policies.zip contains CSV files with content labels for each policy. Labeling was done using a BERT classifier.
Details on the methodology can be found in the accompanying paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is associated with this HiPR-FISH Spatial Mapping of Cheese Rind Microbial Communities pub from Arcadia Science.
HiPR-FISH spatial imaging was used to look at the distribution of microbes within five distinct microbial communities growing on the surface of aged cheeses. Probe design and imaging was performed by Kanvas Biosciences.
This dataset includes the following:
For each field of view (roughly 135µm x 135µm; 7 FOVs per each cheese specimen):
A fluorescence intensity image (*_spectral_max_projection.png/.tif).
A pseudo-colored microbe-labeled image (*_identification.png/.tif).
A data frame contains each identified microbe's identity, position, and size (*_cell_information.csv).
A segmented mask for microbiota (*_segmentation.png/.tif)
A spatial proximity graph for each species close to each other, showing the spatial enrichment over random distribution (*_spatialheatmap.png).
A corresponding data frame used to generate the spatial proximity graph (_absolute_spatial_association.csv) and dataframe for the average of 500 random shuffles of the taxa (_randomized_spatial_association_matrix.csv).
For each cheese specimen:
A widefield image with FOVs located on the image (*_WF_overlay.png).
In general:
A png showing the color legend for each species. (ARC1_taxa_color_legend.png)
A data frame showing the environmental location of each FOV in the cheese (RIND/CURD) and the location of each FOV relative to FOV 1. (ARC1_Cheese_Map.csv).
A vignette showing an example of each cell and its false coloring according to its taxonomic identification (ARC1_detected_species_representative_cell_vignette.png).
Sequences used as input in probe design (16S_18S_forKanvas.fasta).
A CSV file containing the sequences that belong to each ASV (ARC1_sequences_to_ASVs.csv).
Plots of log-transformed counts for each microbe detected across all FOVs, and broken down for each cheese (*detected_species_absolute_abundance.png).
CSVs containing pairwise correlation of FOVs based on spatial association (ARC1_spatial_association_FOV_correlation.csv) and microbial abundance (ARC1_abundance_FOV_correlation.csv).
Plots of spatial association matrices, aggregated for different cheeses and different locations (RIND vs CURD) (*samples_*loc_relative_spatial_association.png).
CSV containing the principle component coordinates for each FOV (ARC1_abundance_FOV_PCA.csv, ARC1_spatial_association_FOV_PCA.csv).
CSV containing the mean fold-change in number of edges between each ASV and the corresponding p-value when compared to the null state (random spatial association matrices) (ARC1_spatial_enrichment_significance.csv).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Download the dataset
At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")
You can visualize the dataset with: df.head()
To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)
Dataset Description
This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.
Globally rising livestock populations and declining wildlife numbers are likely to dramatically change disease risk for wildlife and livestock, especially at resources where they congregate. However, limited understanding of interspecific transmission dynamics at these hotspots hinders disease prediction or mitigation. In this study, we combined gastrointestinal nematode density and host foraging activity measurements from our prior work in this system with three estimates of parasite-sharing capacity to investigate how interspecific exposures alter the relative riskiness of an important resource – water – among cattle and five dominant herbivore species in an East African tropical savanna. We found that due to their high parasite output, water dependence, and parasite-sharing capacity, cattle greatly increased potential parasite exposures at water sources for wild ruminants. When untreated for parasites, cattle accounted for over two-thirds of total potential exposures around water fo..., , , # Dataset for Cattle aggregations at shared resources create potential parasite exposure hotspots for wildlife
https://doi.org/10.5061/dryad.vdncjsz28
These data accompany the publication "Cattle aggregations at shared resources create potential parasite exposure hotspots for wildlife" in Proceedings of the Royal Society B: Biological Sciences (doi: 10.1098/rspb.2023-2239).
The data include three data files and code to replicate results of the publication. Specifically, the data files are:
This resource contains "RouteLink" files for version 2.1.6 of the National Water Model which are used to associate feature identifiers for computational reaches to relevant metadata. These data are important for comparing NWM feature data to USGS streamflow and lake observations. The original RouteLink files are in NetCDF format and available here: https://www.nco.ncep.noaa.gov/pmb/codes/nwprod
This resource includes the files in a human-friendlier CSV format for easier use, and a machine-friendlier file in HDF5 format which contains a single pandas.DataFrame. The scripts and supporting utilities are also included for users that wish to rebuild these files. Source code is hosted here: https://github.com/jarq6c/NWM_RouteLinks
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Speed profiles of freeways in California (I5-S and I210-E). Original data is retrieved from PeMS.
Each YEAR_FREEWAY.csv file contains Timestamp and Speed data.
freeway_meta.csv file contains meta information for each detector: freeway number, direction, detector ID, absolute milepost, and x y coordinates.
# Freeway speed data description
### Data loading example (single freeway: I5-S 2012)
```python
%%time
import pandas as pd
# Date time parser
mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%Y %H:%M:%S")
# Freeway data loading (This part should be changed to a proper URL in zenodo.org)
data = pd.read_csv("dataset/2012_I5S.csv",
parse_dates=["Timestamp"],
date_parser=mydateparser).pivot(index="Timestamp",columns='Station_ID', values='Speed')
# Meta data loading
meta = pd.read_csv("dataset/freeway_meta.csv").set_index(['Fwy','Dir'])
```
CPU times: user 50.5 s, sys: 911 ms, total: 51.4 s
Wall time: 50.9 s
### Speed data and meta data
```python
data.head()
```
Station_ID 1 2 3 4 5 6 7 8 9 10 ... 80 81 82 83 84 85 86 87 88 89 Timestamp 2012-01-01 06:00:00 70.0 69.8 70.1 69.6 69.9 70.8 70.1 69.3 69.2 68.2 ... 72.1 67.6 71.0 66.8 65.9 58.2 67.1 63.8 67.1 71.6 2012-01-01 06:05:00 69.2 69.8 69.8 69.4 69.5 69.5 68.3 67.5 67.4 67.2 ... 71.5 66.1 69.5 67.4 68.3 59.0 66.9 60.8 66.6 65.7 2012-01-01 06:10:00 69.2 69.0 68.6 68.7 68.6 68.9 61.7 68.3 67.4 67.7 ... 71.1 65.2 71.2 66.5 65.4 59.6 66.3 58.4 68.2 65.6 2012-01-01 06:15:00 69.9 69.6 69.7 69.2 69.0 69.1 65.3 67.6 67.1 66.8 ... 69.9 67.1 69.3 66.9 68.2 60.6 66.0 55.5 67.1 69.7 2012-01-01 06:20:00 68.7 68.4 68.2 67.9 68.3 69.3 67.0 68.4 68.2 68.2 ... 70.9 67.2 69.9 65.6 66.7 62.8 66.2 62.6 67.2 67.5
5 rows × 89 columns
```python
meta.head()
```
ID Abs_mp Latitude Longitude Fwy Dir 5 S 1 0.058 32.542731 -117.030501 S 2 0.146 32.543587 -117.031769 S 3 1.291 32.552409 -117.048120 S 4 2.222 32.558422 -117.062360 S 5 2.559 32.561106 -117.067228
### Choose a day
```python
# Sampling (2012-01-13)
myday = "2012-01-13"
# Filter the data by the day
myday_speed_data = data.loc[myday]
```
### A speed profile
```python
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
# Axis value setting
mp = meta[meta.ID.isin(data.columns)].Abs_mp
hour = myday_speed_data.index
# Draw the day
fig, ax = plt.subplots()
heatmap = ax.pcolormesh(hour,mp,myday_speed_data.T, cmap=plt.cm.RdYlGn, vmin=0, vmax=80, alpha=1)
plt.colorbar(heatmap, ax=ax)
# Appearance setting
ax.xaxis.set_major_formatter(mdates.DateFormatter("%H"))
plt.title(pd.Timestamp(myday).strftime("%Y-%m-%d [%a]"))
plt.xlabel("hour")
plt.ylabel("milepost")
plt.show()
```

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An example of .bin file that have an IndexError when processing.
Consider #120 OxWearables / stepcount issue for more details.
The .csv files are 1-second epoch conversions from the .bin file and contain time, x, y, z columns. The conversion was done by:
reading the .bin with the GENEAread R package.
keeping only the time, x, y and z columns.
saving the data.frame into a .csv file.
The only difference between the .csv files is the column format used for the time column before saving:
time column in XXXXXX_....csv had a string class
time column in XXXXXT....csv had a "POSIXct" "POSIXt" class
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For each recorded neuron is:All spike onset times (NAME_spikeTimes.csv)All LFP-SPW onset times (NAME_lfpTimes.csv)All ECoG-SPW onset times (NAME_eegTimes.csv)Dataframe with stimulation onset times and descriptive statistics of LFP-SPWs and ECoG-SPWs preceding stimulations (NAME_df_new.csv)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# General overview
This repository contains the data and code used in the analysis of the
manuscript entitled **"The hidden biodiversity knowledge split in biological collections"**.
# Context
Ecological and evolutionary processes generate biodiversity, yet how biodiversity data are organized and shared globally can shape our understanding of these processes. We show that name-bearing type specimens—the primary reference for species identity—of all freshwater and brackish fish species are predominantly housed in Global North museums, disconnected from their countries of origin. This geographical divide creates a ‘knowledge split’ with consequences for biodiversity science, particularly in the Global South, where researchers face barriers in studying native species’ name bearers housed abroad. Meanwhile, Global North collections remain flooded with non-native name bearers. We relate this imbalance to historical and socioeconomic factors, which ultimately restricts access to critical taxonomic reference materials and hinders global species documentation. To address this disparity, we call for international initiatives to promote fairer access to biological knowledge, including specimen repatriation, improved accessibility protocols for researchers in countries where specimens originated, and inclusive research partnerships.
# Repository structure
## data
This folder stores raw and processed data used to perform all the
analysis presented in this study
### raw
- `flow_period_region_country.csv` a data frame in the long format
containing the flowing of NBT per regions per per time (50-year time
frame). Variables:
- `period` numeric variable representing 50-year time intervals
- `region_type` character representing the name of the World Bank region
of the country where the NBT was sourced
- `country_type` character. A three letter code (alpha-3 ISO3166) representing
the country of the museum where the NBT was sourced
- `region_museum` character. Name of the World Bank region of the country
where the NBT is housed
- `country_museum` character. A three letter code (alpha-3 ISO3166) representing
the country of the museum where the NBT is housed
- `n` numeric. The number of NBT flowing from one country to another
- `spp_native_distribution.csv` data frame in the long format
containing the native composition at the country level. Variables:
- `valid_name` character. The name of a species in the format genus_epithet
according to the Catalog of Fishes
- `country_distribution` character. Three letter code (alpha-3 ISO3166)
indicating the name of the country where a species is native to
- `region_distribution` character. The name of the region acording with
World Bank where a species is native to
- `spp_type_distribution.csv` data frame in the long format containing
the composition of NBT by country. Variables:
- `valid_name` character. The name of a species in the format genus_epithet
according to the Catalog of Fishes
- `country_distribution` character. Three letter code (alpha-3 ISO3166)
indicating the name of the country where a species is housed
- `region_distribution` character. The name of the region acording with
World Bank where a species is housed
- `bio-dem_data.csv` data frame with data downloaded from
[Bio-Dem](https://bio-dem.surge.sh/#awards) containing information
on biological and social information at the country level. Variables:
- `country` character. A three letter code (alpha-3 ISO3166) representing
a country
- `records` numeric. Total number of species occurrence records from Global
Biodiverity Facility (GBIF)
- `records_per_area` numeric. Records per area from gbif
- `yearsSinceIndependence` numeric. Years since independence for each country
- `e_migdppc` numeric. GDP per capta
- `museum_data.csv` data frame with museums' acronyms and the world
region of each. Variables:
- `code_museum` character. The acronym (three letter code) of the museum
- `country_museum` character. A three letter code (alpha-3 ISO3166) representing
a country
- `region_museum` character. The name of the region acording with
World Bank
### processed
- `flow_region.csv` a data frame containing flowing of name bearers among world
regions and the total number of name bearers derived from the source region
- `flow_period_region.csv` a data frame with the number of name bearers between
the world regions per 50-year time frame and the total number of name bearers
in each time frame for each world region
- `flow_period_region_prop.csv` a data frame with the number of name bearers,
the Domestic Contribution and Domestic Retention between the world
regions in a 50-year time frame - this is not used anymore in downstream analyses
- `flow_region_prop.csv` data with the total number of species flowing
between world regions, Domestic Contribution and Domestic Retention - this is no longer used in downstream analyses
- `flow_country.csv` data frame with flowing information of name bearers among
countries
- `df_country_native.csv` data frame with the number of native species
at the country level
- `df_country_type.csv` data frame with the number of name bearers at the
country level
- `df_all_beta.csv` data frame with values of endemic deficit and non-endemic
representation at the country level
## R
The letters `D`, `A` and `V` represents scripts for, respectively, data
processing (D), data analysis (A) and results visualization (V). The
script sequence to reproduce the workflow is indicated by the numbers at
the beginning of the name of the script file
- [`01_D_data_preparation.qmd`](R/01_D_data_preparation.qmd) initial data preparation
- [`02_A_beta-endemics-countries.qmd`](R/02_A_beta-endemics-countries.qmd) analysis of endemic deficit and non endemic representation. This script is used to calculate `native/endemic deficit` and `non-native/non-endemic representation`
- [`03_D_data_preparation_models.qmd`](R/03_D_data_preparation_models.qmd) script used to build data frames that will be used in statistical models ([`04_A_model_NBTs.qmd`](R/04_A_model_NBTs.qmd))
- [`04_A_model_NBTs.qmd`](R/04_A_model_NBTs.qmd) statistical models for the total number of name bearers, endemic deficit and non-endemic representation
- [`05_V_chord_diagram_Fig1.qmd`](R/05_V_chord_diagram_Fig1.qmd) code used to produce circular flow diagram. This is the Figure 1 of the study
- [`06_V_world_map_Fig1.qmd`](R/06_V_world_map_Fig1.qmd) code used to produce the world map in the Figure 1 of the main text
- [08_V_beta_endemics_Fig3.qmd](R/08_V_beta_endemics_Fig3.qmd) code used to build Figure 2 of the main text
- [`09_V_model_Fig4.qmd`](R/09_V_model_Fig4.qmd) code used to build the Figure 3 of the main text. This is the representation of the results of the models present in the script [04_A_model_NBTs.qmd](R/04_A_model_NBTs.qmd)
- [`0010_Supplementary_analysis.qmd`](R/0010_Supplementary_analysis.qmd) code to produce all the tables and figures presented in the Supplementary material of this study
## output
### Figures
In this folder you will find all figures used in the main text and supplementary material of this study
`Fig1_flow_circle_plot.png` Figure with circular plots showing the flux of name bearers among regions of the world in a 50-year time window
`Fig3_turnover_metrics_endemics.png` Cartogram with 3 maps showing the level of endemic deficit
non-endemic representation and the combination of both metrics in a combined map
`Fig4_models.png` Figure showing the predictions of the number of name bearers,
endemic deficit and non-endemic representation for different predictors.
This is derived from the statistical models
#### Supp-material
This folder contains the figures in the Supplementary material
- `FigS1_native_richness.png` World map with countries coloured according to the number of native species richness according to the Catalog of Fishes
- `FigS3_turnover_metrics.png` Cartogram with 3 maps showing the level of
native deficit, non-native representation and the combination of both metrics in a combined map
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description: This data set contains the street network of New York City, retrieved using Osmnx from OpenStreetMap and converted to a GeoPandas DataFrame. The street network is represented as a series of linestrings that connect nodes representing intersections in the road network. The data set can be used for a variety of purposes, such as urban planning, transportation analysis, and spatial modeling.
Source: The data set was retrieved using Osmnx, a Python package for downloading and analyzing OpenStreetMap data, and converted to a GeoPandas DataFrame using the osmnx.graph_to_gdfs() function. OpenStreetMap: https://www.openstreetmap.org/#map=4/21.82/82.79
Date: The data set was retrieved on February 24, 2023, and represents the street network of New York City as of that date.
Format: Comma-separated values (CSV) file.
Attributes: The data set includes various attributes for nodes and edges, including geographic coordinates, street names, length, and directionality
The data retrieved using Osmnx can be used for a variety of purposes, including urban planning, transportation engineering, and spatial analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Manuscript, data, and code associated with a germination experiment using seed enhancement technologies in New South Wales, Australia.
Two scripts provided for use in R 1. 'treatment_comparisons.txt' details treatment-wise comparisons of emergence, survival, and average time to emergence between treatments (1) bare seed and (2) pelletised replicates of native species 2. 'trait_script.txt' details comparisons of seed morphological traits as predictors of species performance using pellets
Three major dataframes provided: Emergence_data.csv - raw emergence data from the experiment seed_traits_no_se.csv - average seed morphological trait information from x-ray images emergence_traits.csv- emergence speed data from species in the experiment
Three supporting dataframes provided: Amenability.csv - characterised amenability results_bin.csv - dataframe based on treatment models to use in plotting results pairwise_letters.csv - dataframe based on treatment models to use in plotting results
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports the "Gene expression has more power for predicting in vitro cancer cell vulnerabilities than genomics" preprint by Dempster et al. To generate the figure panels seen in the preprint using these data, use FigurePanelGeneration.ipynb. This study includes five datasets (citations and details in manuscript).Achilles: the Broad Institute's DepMap public 19Q4 CRISPR knockout screens processed with CERESScore: The Sanger Wellcome Institute's Project Score CRISPR knockout screens processed with CERESRNAi: The DEMETER2-processed combined dataset which includes RNAi data from Achilles, DRIVE, and Marcotte breast screens.PRISM: The PRISM pooled in vitro repurposing primary screen of compoundsGDSC17: Cancer drug in vitro drug screens performed by SangerThe files of most interest to a biologist are Summary.csv. If you are interested in trying machine learning, the files Features.hdf5 and Target.hdf5 contain the data munged in a convenient form for standard supervised machine learning algorithms.Some large files are in the binary format hdf5 for efficiency in space and read-in. These files each contain three named hdf5 datasets. "dim_0" holds the row/index names as an array of strings, "dim_1" holds the column names as an array of strings, and "data" holds the matrix contents as a 2D array of floats. In python, these files can be read in with: import pandas as pd import h5py def read_hdf5(filename): src = h5py.File(filename, 'r') try: dim_0 = [x.decode('utf8') for x in src['dim_0']] dim_1 = [x.decode('utf8') for x in src['dim_1']] data = np.array(src['data']) return pd.DataFrame(index=dim_0, columns=dim_1, data=data) finally: src.close()##################################################################Files (not every dataset will have every type of file listed below):##################################################################AllFeaturePredictions.hdf5: Matrix of cell lines by perturbations, with values indicating the predicted viability using a model with all feature types.ENAdditionScore.csv: A matrix of perturbations by number of features. Values indicate an elastic net model performance (Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5) using only the top X features, where X is the column header.FeatureDropScore.csv: Perturbations and predictive performance for a model using all single gene expression features EXCEPT those that had greater than 0.1 feature importance in a model trained with all single gene expression features. Features.hdf5: A very large matrix of all cell lines by all used CCLE cell features. Continuous features were zscored. Cell lines missing mutation or expression data were dropped. Remaining NA values were imputed to zero. Features types are indicated by the column matrix suffixes: _Exp: expression _Hot: hotspot mutation _Dam: damaging mutation _OtherMut: other mutation _CN: copy number _GSEA: ssGSEA score for an MSigDB gene set _MethTSS: Methylation of transcription start sites _MethCpG: Methylation of CpG islands _Fusion: Gene fusions _Cell: cell tissue propertiesNormLRT.csv: the normLRT score for the given perturbationRFAdditionScore.csv: similar to ENAdditionScore, but using a random forest model.Summary.csv: A dataframe containing predictive model results. Columns: model: Specifies the collection of features used (Expression, Mutation, Exp+CN, etc) gene: The perturbation (column in Target.hdf5) examined. Actually a compound for the PRISM and GDSC17 datasets. overall_pearson: Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5 feature: the Nth most important feature, found by retraining the model with all cell lines (N = 0-9) feature_importance: the feature importance as assessed by sklearn's RandomForestRegressorTarget.hdf5: A matrix of cell lines by perturbations, with entries indicating post-perturbation viability scores. Note that the scales of the viability effects are different for different datasets. See manuscript methods for details.PerturbationInfo.csv: Additional drug annotations for the PRISM and GDSC17 datasetsApproximateCFE.hdf5: A set of Cancer Functional Event cell features based on CCLE data, adapted from Iorio et al. 2016 (10.1016/j.cell.2016.06.017)DepMapSampleInfo.csv: sample info from DepMap_public_19Q4 data, reproduced here as a convenience.GeneRelationships.csv: A list of genes and their related (partner) genes, with the type of relationship (self, protein-protein interaction, CORUM complex membership, paralog). OncoKB_oncogenes.csv: A list of genes that have non-expression-based alterations listed as likely oncogenic or oncogenic by OncoKB as of 9 May 2018.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is associated with the article "Occupancy and detection of agricultural threats: the case of Philaenus spumarius, European vector of Xylella fastidiosa" by the same authors published in JOURNAL 2021 . The data about Philaenus spumarius and other co-occurring species were collected in Trentino, Italy, during the spring and summer 2018 in olive orchards and vineyards. Here are provided the raw data, some preprocessed data and the R codes that we used for the analysis presented in the publication. Please refer to the above mentioned article for more details.
List of files:
samplings.xlsx original dataset of field sampling (Sheet: survey), site coordinates and info (sheet: info site) and metadata (sheet: legenda) counts_per_site.csv occupancy abundance dataframe for p. spumarius philaenus_occupancy_data.csv occupancy presence dataframe for p. spumarius sites.cov.csv site covariates for occupancy model observation.cov.csv observation covariates for occupancy mode Rcode.zip commented code and data in R format to run occupancy models for P. Spumarius
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
data.csv: The main data frame used for the primary analysis in long format data_combined.csv: The main data frame used and formatted for the primary analysisdata_spatial.csv: The data frame used for the spatial autocorrelationREADME file: columns of data frames explained
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/