Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PandasPlotBench
PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
This dataset was created by Shail_2604
Released under Other (specified in description)
quadrat.scale.dataRefer to R script ("Dwyer_&_Laughlin_2017_Trait_covariance_script.r" for information about this dataframe.species.in.quadrat.scale.dataRefer to R script ("Dwyer_&_Laughlin_2017_Trait_covariance_script.r" for information about this dataframe.Dwyer_&_Laughlin_2017_Trait_covariance_scriptThis script reads in the two dataframes of "raw" data, calculates diversity and trait metrics and runs the major analyses presented in Dwyer & Laughlin 2017.
polyOne Data Set
The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.
Full data set including the properties
The data files are in Apache Parquet format. The files start with polyOne_*.parquet
.
I recommend using dask (pip install dask
) to load and process the data set. Pandas also works but is slower.
Load sharded data set with dask
python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe
PSMILES strings only
generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here we present three datasets describing three large European landscapes in France (Bauges Geopark - 89,000 ha), Poland (Milicz forest district - 21,000 ha) and Slovenia (Snežnik forest - 4,700 ha) down to the tree level. Individual trees were generated combining inventory plot data, vegetation maps and Airborne Laser Scanning (ALS) data. Together, these landscapes (hereafter virtual landscapes) cover more than 100,000 ha including about 64,000 ha of forest and consist of more than 42 million trees of 51 different species. For each virtual landscape we provide a table (in .csv format) with the following columns:- cellID25: the unique ID of each 25x25 m² cell- sp: species latin names- n: number of trees. n is an integer >= 1, meaning that a specific set of species "sp", diameter "dbh" and height "h" can be present multiple times in a cell.- dbh: tree diameter at breast height (cm)- h: tree height (m) We also provide, for each virtual landscape, a raster (in .asc format) with the cell IDs (cellID25) which makes data spatialisation possible. The coordinate reference systems are EPSG: 2154 for the Bauges, EPSG: 2180 for Milicz, and EPSG: 3912 for Sneznik. The v2.0.0 presents the algorithm in its final state. Finally, we provide a proof of how our algorithm makes it possible to reach the total BA and the BA proportion of broadleaf trees provided by the ALS mapping using the alpha correction coefficient and how it maintains the Dg ratios observed on the field plots between the different species (see algorithm presented in the associated Open Research Europe article). Below is an example of R code that opens the datasets and creates a tree density map. ------------------------------------------------------------# load package library(terra) library(dplyr)
setwd() # define path to the I-MAESTRO_data folder
tree <- read.csv2('./sneznik/sneznik_trees.csv', sep = ',')
cellID <- rast('./sneznik/sneznik_cellID25.asc')
cellIDdf <- as.data.frame(cellID) colnames(cellIDdf) <- 'cellID25'
dens <- tree %>% group_by(cellID25) %>% summarise(n = sum(n))
dens <- left_join(cellIDdf, dens, join_by(cellID25))
cellID$dens <- dens$n
plot(cellID$dens)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With columns corresponding to the parameter names of a naive Monod process model, the parametrization of each replicate, identified by a replicate ID (rid) is specified in a tabular format. Parameter identifiers that appear multiple times (e.g. S0) correspond to a parameter shared across replicates. Accordingly, replicate-local parameters names simply do not appear multiple times (e.g. X0_A06). Numeric entries are interpreted as fixed values and will be left out of parameter estimation. Columns do not need to be homogeneously fixed/shared/local, but parameters can only be shared within the same column. The parameter mapping can be provided as a DataFrame object.
https://market.oceanprotocol.com/termshttps://market.oceanprotocol.com/terms
Until August 18th, 2023.
Aave User Data Description
This DataFrame consists of user-specific data related to the Aave protocol, with 11,899 entries across 18 different columns. Here is the detailed description of each column:
user (object): The address of the user involved in the transactions. cohort_ID (period[M]): A cohort identifier for grouping related transactions. first_transaction_date (datetime64[ns, UTC]): The date of the user's first transaction. is_aave_v2_user (bool): A flag indicating whether the user has interacted with Aave's V2 protocol. total_transactions (int64): The total number of transactions performed by the user. total_usd_transacted (float64): The total amount in USD transacted by the user. avg_transaction_value (float64): The average value of the user's transactions in USD. total_usd_flashloans (float64): The total USD value of flash loans taken by the user. number_of_deposits (int64): The total number of deposits made by the user. number_of_borrows (int64): The total number of borrows made by the user. number_of_repays (int64): The total number of repayments made by the user. number_of_liquidations (int64): The total number of liquidations executed by the user. number_of_withdraws (int64): The total number of withdrawals made by the user. average_ltv (float64): The average loan-to-value ratio for the user (6012 non-null entries). average_ltv_borrow (float64): The average loan-to-value ratio for borrows specifically (6020 non-null entries). number_of_unique_symbols_transacted (int64): The number of unique symbols the user has transacted with. trading_assets_category_type (object): The category type of assets traded by the user. number_of_unique_months_active (int64): The number of unique months the user has been active.
The data types include bool, datetime64[ns, UTC], float64, int64, object, and period[M], with a total memory usage of 1.6+ MB.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DNA methylation modification can regulate gene expression without changing the genome sequence, which helps organisms to rapidly adapt to new environments. However, few studies have been reported in non-model mammals. Giant panda (Ailuropoda melanoleuca) is a flagship species for global biodiversity conservation. Wildness and reintroduction of giant pandas are the important content of giant pandas’ protection. However, it is unclear how wildness training affects the epigenetics of giant pandas, and we lack the means to assess the adaptive capacity of wildness training giant pandas. We comparatively analyzed genome-level methylation differences in captive giant pandas with and without wildness training to determine whether methylation modification played a role in the adaptive response of wildness training pandas. The whole genome DNA methylation sequencing results showed that genomic cytosine methylation ratio of all samples was 5.35%–5.49%, and the methylation ratio of the CpG site was the highest. Differential methylation analysis identified 544 differentially methylated genes (DMGs). The results of KEGG pathway enrichment of DMGs showed that VAV3, PLCG2, TEC and PTPRC participated in multiple immune-related pathways, and may participate in the immune response of wildness training giant pandas by regulating adaptive immune cells. A large number of DMGs enriched in GO terms may also be related to the regulation of immune activation during wildness training of giant pandas. Promoter differentially methylation analysis identified 1,199 genes with differential methylation at promoter regions. Genes with low methylation level at promoter regions and high expression such as, CCL5, P2Y13, GZMA, ANP32A, VWF, MYOZ1, NME7, MRPS31 and TPM1 were important in environmental adaptation for wildness training giant pandas. The methylation and expression patterns of these genes indicated that wildness training giant pandas have strong immunity, blood coagulation, athletic abilities and disease resistance. The adaptive response of giant pandas undergoing wildness training may be regulated by their negatively related promoter methylation. We are the first to describe the DNA methylation profile of giant panda blood tissue and our results indicated methylation modification is involved in the adaptation of captive giant pandas when undergoing wildness training. Our study also provided potential monitoring indicators for the successful reintroduction of valuable and threatened animals to the wild.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description 🙅♂️🤖
Bank of Ghana historical and real-time treasury bills data. Bank of Ghana Click Here:
Data Format
{ "issue_date": "...", "tender": "...", "security_type": "...", "discount_rate": "...", "interest_rate": "..." }
Load Dataset
pip install datasets
from datasets import load_dataset
treasury = load_dataset("worldboss/bank-of-ghana-treasury-bills", split="train")
pd.DataFrame(treasury).head()… See the full description on the dataset page: https://huggingface.co/datasets/worldboss/bank-of-ghana-treasury-bills.
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Preprocessed from https://huggingface.co/datasets/lorenzoscottb/PLANE-ood/ df=pd.read_json('https://huggingface.co/datasets/lorenzoscottb/PLANE-ood/resolve/main/PLANE_trntst-OoV_inftype-all.json') f = lambda df: pd.DataFrame(list(zip(*[df[c] for c in df.index])),columns=df.index) ds=DatasetDict() for split in ['train','test']: dfs=pd.concat([f(df[c]) for c in df.columns if split in c.lower()]).reset_index(drop=True) dfs['label']=dfs['label'].map(lambda x:{1:'entailment'… See the full description on the dataset page: https://huggingface.co/datasets/tasksource/PLANE-ood.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset has been collected from multiple sources provided by MVCR on their websites and contains daily summarized statistics as well as details statistics up to age & sex level.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
Date - Calendar date when data were collected Daily tested - Sum of tests performed Daily infected - Sum of confirmed cases those were positive Daily cured - Sum of cured people that does not have Covid-19 anymore Daily deaths - Sum of people those died on Covid-19 Daily cum tested - Cumulative sum of tests performed Daily infected - Cumulative sum of confirmed cases those were positive Daily cured - Cumulative sum of cured people that does not have Covid-19 anymore Daily deaths - Cumulative sum of people those died on Covid-19 Region - Region of Czech republic Sub-Region - Sub-Region of Czech republic Region accessories qty - Quantity of health care accessories delivered to region for all the time Age - Age of person Sex - Sex of person Infected - Sum of infected people for specific date, region, sub-region, age and sex Cured - Sum of cured people for specific date, region, sub-region, age and sex Death - Sum of people those dies on Covid-19 for specific date, region, sub-region, age and sex
Dataset contains data on different level of granularities. Make sure you do not mix different granularities. Let's suppose you have loaded data into pandas dataframe called df.
df_daily = df.groupby(['date']).max()[['daily_tested','daily_infected','daily_cured','daily_deaths','daily_cum_tested','daily_cum_infected','daily_cum_cured','daily_cum_deaths']].reset_index()
df_region = df[df['region'] != ''].groupby(['region']).agg(
region_accessories_qty=pd.NamedAgg(column='region_accessories_qty', aggfunc='max'),
infected=pd.NamedAgg(column='infected', aggfunc='sum'),
cured=pd.NamedAgg(column='cured', aggfunc='sum'),
death=pd.NamedAgg(column='death', aggfunc='sum')
).reset_index()
df_detail = df[['date','region','sub_region','age','sex','infected','cured','death']].reset_index(drop=True)
Thanks to websites of MVCR for sharing such great information.
Can you see relation between health care accessories delivered to region and number of cured/infected in that region? Why Czech Republic belongs to pretty safe countries when talking about Covid-19 Pandemic? Can you find out what is difference of pandemic evolution in Czech Republic comparing to other surrounding coutries, like Germany or Slovakia?
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset has been collected from multiple sources provided by MVCR on their websites and contains daily summarized statistics as well as details statistics up to age & sex level.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
Date - Calendar date when data were collected Daily tested - Sum of tests performed Daily infected - Sum of confirmed cases those were positive Daily cured - Sum of cured people that does not have Covid-19 anymore Daily deaths - Sum of people those died on Covid-19 Daily cum tested - Cumulative sum of tests performed Daily infected - Cumulative sum of confirmed cases those were positive Daily cured - Cumulative sum of cured people that does not have Covid-19 anymore Daily deaths - Cumulative sum of people those died on Covid-19 Region - Region of Czech republic Sub-Region - Sub-Region of Czech republic Region accessories qty - Quantity of health care accessories delivered to region for all the time Age - Age of person Sex - Sex of person Infected - Sum of infected people for specific date, region, sub-region, age and sex Cured - Sum of cured people for specific date, region, sub-region, age and sex Death - Sum of people those dies on Covid-19 for specific date, region, sub-region, age and sex Infected abroad - Identifies if person was infected by Covid-19 in Czech republic or abroad Infected in country - code of country from where person came (origin country of Covid-19)
Dataset contains data on different level of granularities. Make sure you do not mix different granularities. Let's suppose you have loaded data into pandas dataframe called df.
df_daily = df.groupby(['date']).max()[['daily_tested','daily_infected','daily_cured','daily_deaths','daily_cum_tested','daily_cum_infected','daily_cum_cured','daily_cum_deaths']].reset_index()
df_region = df[df['region'] != ''].groupby(['region']).agg(
region_accessories_qty=pd.NamedAgg(column='region_accessories_qty', aggfunc='max'),
infected=pd.NamedAgg(column='infected', aggfunc='sum'),
cured=pd.NamedAgg(column='cured', aggfunc='sum'),
death=pd.NamedAgg(column='death', aggfunc='sum')
).reset_index()
df_detail = df[['date','region','sub_region','age','sex','infected','cured','death','infected_abroad','infected_in_country']].reset_index(drop=True)
Thanks to websites of MVCR for sharing such great information.
Can you see relation between health care accessories delivered to region and number of cured/infected in that region? Why Czech Republic belongs to pretty safe countries when talking about Covid-19 Pandemic? Can you find out what is difference of pandemic evolution in Czech Republic comparing to other surrounding coutries, like Germany or Slovakia?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Weather data collected in the Otemma forefield (Switzerland) from 14 July 2019 to 18 November 2021.
Data were collected by the research teams of Bettina Schaefli2 and Stuart N. Lane1.
1 Institute of Earth Surface Dynamics (IDYST), University of Lausanne, 1015 Lausanne, Switzerland
2 Institute of Geography (GIUB), University of Bern, 3012 Bern, Switzerland
For further information, please contact:
tom.muller.1@unil.ch
bettina.schaefli@giub.unibe.ch
Description of data : WeatherData.csv
Time span of data : 14 July 2019 to 18 November 2021
Time step : homogenized 10 minutes-averaged data
Location of data (Coordinate in SWISS LV95 (EPSG:2056)
- Glacier snout Station : 2598615 / 1087375
- Glacier center Station : 2600495 / 1088631
- Floodplain Station : 2598096 / 1087087
STRUCTURE OF DATA : tidy dataframe with following headers :
- date : local date (UTC+01 with daylight saving time)
- variable : parameter of interest, with following classes :
Air_humidity : Air humidity in percent of air saturation [%]
Air_temperature : Air temperature in [°C]
Atm_pressure : Atmospheric pressure in [hPa]
Incoming_radiation : Incoming shortwave radiation in [W/m2]
Precipitation : Liquid precipitation measured [mm]
- name : location of data (see coordinates above)
- dateUTC : date with UTC timezone
Device used for data acquisition :
- Air_humidity/Air_temperature/Atm_pressure : Decagon VP-4
- Incoming_radiation : Apogee Instruments SP-11
- Precipitation : Double tipping buckets rain gauge from Davis Instruments (resolution 0.2 mm)
Description of data : RainComposite_Otemma_Arolla.csv
Time span of data : 01 July 2019 to 18 November 2021
Time step : homogenized 10 minutes-averaged data
The dataset compiles the measured rain during summer at the closest weather station (Glacier snout).
For the winter period (and few gaps during the summer), the solid precipitations (snow) from the closest MeteoSwiss weather stations (SwissMetNet) were used.
Gaps where filled with 1) MeteoSwiss Otemma and if data were still missing filled with 2) MeteoSwiss Arolla
Location of data (Coordinate in SWISS LV95 (EPSG:2056)
- Glacier snout Station : 2598615 / 1087375
- Otemma camp Station : 2597508 / 1086653
- MeteoSwiss Station : 2596476 / 1085864
- MeteoSwiss Arolla : 2603507 / 1095832
STRUCTURE OF DATA : tidy dataframe with following headers :
- date : local date (UTC+01 with dst)
- variable : parameter of interest
Precipitation : Liquid and solid precipitation [mm]. Composite dataset composed of melted snow (snow Water Equivalent, in mm, from MeteoSwiss station) and Rain (in mm from Glacier station).
- Location : location of data (see above)
- dateUTC : date in UTC timezone
Device used for data acquisition :
- Glacier snout Station : Double tipping buckets rain gauge from Davis Instruments (resolution 0.2 mm)
- Otemma camp Station : Double tipping buckets rain gauge, Spectrum WatchDog 1120 (resolution 0.25 mm)
- MeteoSwiss : see SwissMetNet project
Description of data : Otemma_weather_Plot_alldata.html
An interactive plot generated with python plotly (open in web browser) containing all above described data.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset of TFRecords files made from Plant Pathology 2021 original competition data. Changes:
* labels
column of the initial train.csv
DataFrame was binarized to multi-label format columns: complex
, frog_eye_leaf_spot
, healthy
, powdery_mildew
, rust
, and scab
* images were scaled to 512x512
* 77 duplicate images having different labels were removed (see the context in this notebook)
* samples were stratified and split into 5 folds (see corresponding folders fold_0
:fold_4
)
* images were heavily augmented with albumentations
library (for raw images see this dataset)
* each folder contains 5 copies of randomly augmented initial images (so that the model never meets the same images)
I suggest adding all 5 datasets to your notebook: 4 augmented datasets = 20 epochs of unique images (1, 2, 3, 4) + 1 raw dataset for validation here.
For a complete example see my TPU Training Notebook
train.csv
folds.csv
fold_0
:fold_4
folders containing 64 .tfrec
files, respectively, with feature map shown below:
feature_map = {
'image': tf.io.FixedLenFeature([], tf.string),
'name': tf.io.FixedLenFeature([], tf.string),
'complex': tf.io.FixedLenFeature([], tf.int64),
'frog_eye_leaf_spot': tf.io.FixedLenFeature([], tf.int64),
'healthy': tf.io.FixedLenFeature([], tf.int64),
'powdery_mildew': tf.io.FixedLenFeature([], tf.int64),
'rust': tf.io.FixedLenFeature([], tf.int64),
'scab': tf.io.FixedLenFeature([], tf.int64)}
### AcknowledgementsUPDATED on October 15 2020 After some mistakes in some of the data were found, we updated this data set. The changes to the data are detailed on Zenodo (http://doi.org/10.5281/zenodo.4061807), and an Erratum has been submitted. This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 165 data sources, representing a total of 1668 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).
The dataset is an excerpt of the validation dataset used in:
Ruiz-Arias JA, Gueymard CA. Review and performance benchmarking of 1-min solar irradiance components separation methods: The critical role of dynamically-constrained sky conditions. Submitted for publication to Renewable and Sustainable Energy Reviews.
and it is ready to use in the Python package splitting_models developed during that research. See the documentation in the Python package for usage details. Below, there is a detailed description of the dataset.
The data is in a single parquet file that contains 1-min time series of solar geometry, clear-sky solar irradiance simulations, solar irradiance observations and CAELUS sky types for 5 BSRN sites, one per primary Köppen-Geiger climate, namely: Minamitorishima (mnm), JP, for equatorial climate; Alice Springs (asp), AU, for dry climate; Carpentras (car), FR, for temperate climate; Bondville (bon), US, for continental climate; and Sonnblick (son), AT, for cold/polar/snow climate. It includes one calendar year per site. The BSRN data is publicly available. See download instructions in https://bsrn.awi.de/data.
The specific variables included in the dataset are:
climate: primary Köppen-Geiger climate. Values are: A (equatorial), B (dry), C (temperate), D (continental) and E (polar/snow).
longitude: longitude, in degrees east.
latitude: latitude, in degrees north.
sza: solar zenith angle, in degrees.
eth: extraterrestrial solar irradiance (i.e., top of atmosphere solar irradiance), in W/m2.
ghics: clear-sky global solar irradiance, in W/m2. It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere.
difcs: clear-sky diffuse solar irradiance, in W/m2.It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere.
ghicda: clean-and-dry clear-sky global solar irradiance, in W/m2. It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere, prescribing zero aerosols and zero precipitable water.
ghi: observed global horizontal irradiance, in W/m2.
dif: observed diffuse irradiance, in W/m2.
sky_type: CAELUS sky type. Values are: 1 (unknown), 2 (overcast), 3 (thick clouds), 4 (scattered clouds), 5 (thin clouds), 6 (cloudless) and 7 (cloud enhancement).
The dataset can be easily loaded in a Python Pandas DataFrame as follows:
import pandas as pd
data = pd.read_parquet(
The dataframe has a multi-index with two levels: times_utc and site. The former are the UTC timestamps at the center of each 1-min interval. The latter is each site's label.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
DArk Matter SPIkes (DAMSPI) is a fully Python-based software for the analysis of dark matter spikes around Intermediate Mass Black Holes (IMBHs) in the Milky Way. It allows to extract an IMBH catalogue and their corresponding dark matter spike parameters from the EAGLE simulations in order to probe a potential gamma-ray signal from dark matter self-annihilation.
The dataset contains the IMBH catalogue including, among others, the coordinates, mass, formation redshift and spike parameters for each individual IMBH. Each column of the catalogue is described in detail in J. Aschersleben et al. (2024). We also provide separate files for which we calculated the gamma-ray fluxes for different dark matter masses and annihilation cross sections. Lastly, we provide a catalogue of our selection of Milky Way like galaxies within EAGLE. The columns of these files are also described in J. Aschersleben et al. (2024).
The source code to extract this dataset is publicy available here:
https://doi.org/10.5281/zenodo.11488472
Description of the data files
The imbh_catalogue/imbh/ directory contains the following files:
catalogue_nfw.h5
catalogue_cored_gamma_0p3.h5
catalogue_cored_gamma_0p9.h5
catalogue_cored_gamma_free.h5
They contain the IMBH catalogues, including the coordinates and dark matter spike parameters, calculated assuming the 1.) NFW profile, 2.) cored profile with a fixed core index of 0.0, 3.) cored profile with a fixed core index of 0.4 and 4.) cored profile with the core index as a free fitting parameter.
The imbh_catalogue/flux/
The imbh_catalogue/galaxy/ directory contains the mw_galaxies_catalogue_nfw.h5 file which contains our selection of Milky Way-like galaxies within EAGLE.
The HDF files can be opened in Python with:
import pandas as pd
file_path = "
df = pd.read_hdf(file_path, key="table")
print(df.head())
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all the relevant data for the algorithms described in the paper "Irradiance and cloud optical properties from solar photovoltaic systems", which were developed within the framework of the MetPVNet project.
Input data:
COSMO weather model data (DWD) as NetCDF files (cosmo_d2_2018(9).tar.gz)
COSMO atmospheres for libRadtran (cosmo_atmosphere_libradtran_input.tar.gz)
COSMO surface data for calibration (cosmo_pvcal_output.tar.gz)
Aeronet data as text files (MetPVNet_Aeronet_Input_Data.zip)
Measured data from the MetPVNet measurement campaigns as text files (MetPVNet_Messkampagne_2018(9).tar.gz)
PV power data
Horizontal and tilted irradiance from pyranometers
Longwave irradiance from pyrgeometer
MYSTIC-based lookup table for translated tilted to horizontal irradiance (gti2ghi_lut_v1.nc)
Output data:
Global tilted irradiance (GTI) inferred from PV power plants (with calibration parameters in comments)
Linear temperature model: MetPVNet_gti_cf_inversion_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_gti_cf_inversion_results_faiman.tar.gz
Global horizontal irradiance (GHI) inferred from PV power plants
Linear temperature model: MetPVNet_ghi_inversion_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_ghi_inversion_results_faiman.tar.gz
Combined GHI averaged to 60 minutes and compared with COSMO data
Linear temperature model: MetPVNet_ghi_inversion_combo_60min_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_ghi_inversion_combo_60min_results_faiman.tar.gz
Cloud optical depth inferred from PV power plants
Linear temperature model: MetPVNet_cod_cf_inversion_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_cod_cf_inversion_results_faiman.tar.gz
Combined COD averaged to 60 minutes and compared with COSMO and APOLLO_NG data
Linear temperature model: MetPVNet_cod_inversion_combo_60min_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_cod_inversion_combo_60min_results_faiman.tar.gz
Validation data:
COSMO cloud optical depth (cosmo_cod_output.tar.gz)
APOLLO_NG cloud optical depth (MetPVNet_apng_extract_all_stations_2018(9).tar.gz)
COSMO irradiance data for validation (cosmo_irradiance_output.tar.gz)
CAMS irradiance data for validation (CAMS_irradiation_detailed_MetPVNet_MK_2018(9).zip)
How to import results:
The results files are stored as text files ".dat", using Python multi-index columns. In order to import the data into a Pandas dataframe, use the following lines of code (replace [filename] with the relevant file name):
import pandas as pd data = pd.read_csv("[filename].dat",comment='#',header=[0,1],delimiter=';',index_col=0,parse_dates=True)
This gives a multi-index Dataframe with the index column the timestamp, the first column label corresponds to the measured variable and the second column to the relevant sensor
Note:
The output data has been updated to match the latest version of the paper, whereas the input and validation data remains the same as in Version 1.0.0
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Port of the compas-recidivism dataset from propublica (github here). See details there and use carefully, as there are serious known social impacts and biases present in this dataset. Basic preprocessing done by the imodels team in this notebook. The target is the binary outcome is_recid.
Sample usage
Load the data: from datasets import load_dataset
dataset = load_dataset("imodels/compas-recidivism") df = pd.DataFrame(dataset['train']) X = df.drop(columns=['is_recid']) y =… See the full description on the dataset page: https://huggingface.co/datasets/imodels/compas-recidivism.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Benchmark Dataset for Deep Learning-based Methods for 3D Topology Optimization.
One can find a description of the provided dataset partitions in Section 3 of Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.
Every dataset container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and a corresponding binarized SIMP solution. Every file of the form {i}.csv contains all voxel-wise information about the sample i. Every file of the form {i}_info.csv file contains scalar parameters of the topology optimization problem, such as material parameters.
This dataset represents topology optimization problems and solutions on the bases of voxels. We define all spatially varying quantities via the voxels' centers -- rather than via the vertices or surfaces of the voxels.
In {i}.csv files, each row corresponds to one voxel in the design space. The columns correspond to ['x', 'y', 'z', 'design_space', 'dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density'].
Any of these files with the index i can be imported using pandas by executing:
import pandas as pd
directory = ...
file_path = f'{directory}/{i}.csv'
column_names = ['x', 'y', 'z', 'design_space','dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density']
data = pd.read_csv(file_path, names=column_names)
From this pandas dataframe one can extract the torch tensors of forces F, Dirichlet conditions ωDirichlet, and design space information ωdesign using the following functions:
import torch
def get_shape_and_voxels(data):
shape = data[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
vox_x = data['x'].values
vox_y = data['y'].values
vox_z = data['z'].values
voxels = [vox_x, vox_y, vox_z]
return shape, voxels
def get_forces_boundary_conditions_and_design_space(data, shape, voxels):
F = torch.zeros(3, *shape, dtype=torch.float32)
F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_x'].values, dtype=torch.float32)
F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_y'].values, dtype=torch.float32)
F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_z'].values, dtype=torch.float32)
ω_Dirichlet = torch.zeros(3, *shape, dtype=torch.float32)
ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_x'].values, dtype=torch.float32)
ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_y'].values, dtype=torch.float32)
ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_z'].values, dtype=torch.float32)
ω_design = torch.zeros(1, *shape, dtype=int)
ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['design_space'].values.astype(int))
return F, ω_Dirichlet, ω_design
The corresponding {i}_info.csv files only have one row with column labels ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z'].
Analogously to above, one can import any {i}_info.csv file by executing:
file_path = f'{directory}/{i}_info.csv'
data_info_column_names = ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z']
data_info = pd.read_csv(file_path, names=data_info_column_names)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PandasPlotBench
PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.