Dataset Card for "Magicoder-Evol-Instruct-110K-python"
from datasets import load_dataset
dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split
def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Benchmark Dataset for Deep Learning for 3D Topology Optimization
This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.
The following paper provides full documentation and examples:
Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.
The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ
file container consists of multiple enumerated pairs of CSV
files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i
-th sample is stored in the files i.csv
and i_info.csv
, where i.csv
contains all voxel-wise information and i_info.csv
contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.
For the i
-th sample, the columns of i_info.csv
correspond to the following scalar information:
E
- Young's modulus [Pa]ν
- Poisson's ratio [-]σ_ys
- a yield stress [Pa]h
- discretization size of the voxel grid [m]The columns of i.csv
correspond to the following voxel-wise information:
x
, y
, z
- the indices that state the location of the voxel within the voxel meshΩ_design
- design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0
and 1
indicate that the density is fixed at 0 or 1, respectively. -1
indicates the absence of constraints, i.e., the density in that voxel can be freely optimizedΩ_dirichlet_x
, Ω_dirichlet_y
, Ω_dirichlet_z
- homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimensionF_x
, F_y
, F_z
- floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]density
- defines the binary voxel-wise density of the ground truth solution to the topology optimization problem
How to Import the Dataset
with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset
object. As shown in the tutorial this can be done via:
from dl4to.datasets import SELTODataset
dataset = SELTODataset(root=root, name=name, train=train)
Here, root
is the path where the dataset should be saved. name
is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train
is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset
class.
without DL4TO: After downloading and unzipping, any of the i.csv
files can be manually imported into Python as a Pandas dataframe object:
import pandas as pd
root = ...
file_path = f'{root}/{i}.csv'
columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density']
df = pd.read_csv(file_path, names=columns)
Similarly, we can import a i_info.csv
file via:
file_path = f'{root}/{i}_info.csv'
info_column_names = ['E', 'ν', 'σ_ys', 'h']
df_info = pd.read_csv(file_path, names=info_columns)
We can extract PyTorch tensors from the Pandas dataframe df
using the following function:
import torch
def get_torch_tensors_from_dataframe(df, dtype=torch.float32):
shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
voxels = [df['x'].values, df['y'].values, df['z'].values]
Ω_design = torch.zeros(1, *shape, dtype=int)
Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
F = torch.zeros(3, *shape, dtype=dtype)
F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
density = torch.zeros(1, *shape, dtype=dtype)
density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
return Ω_design, Ω_Dirichlet, F, density
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Python CodeSearch Dataset (Shuu12121/python-treesitter-filtered-datasetsV2)
Dataset Description
This dataset contains Python functions paired with their documentation strings (docstrings), extracted from open-source Python repositories on GitHub. It is formatted similarly to the CodeSearchNet challenge dataset. Each entry includes:
code: The source code of a python function or method. docstring: The docstring or Javadoc associated with the function/method. func_name: The… See the full description on the dataset page: https://huggingface.co/datasets/Shuu12121/python-treesitter-filtered-datasetsV2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Site Description:
In this dataset, there are seventeen production crop fields in Bulgaria where winter rapeseed and wheat were grown and two research fields in France where winter wheat – rapeseed – barley – sunflower and winter wheat – irrigated maize crop rotation is used. The full description of those fields is in the database "In-situ crop phenology dataset from sites in Bulgaria and France" (doi.org/10.5281/zenodo.7875440).
Methodology and Data Description:
Remote sensing data is extracted from Sentinel-2 tiles 35TNJ for Bulgarian sites and 31TCJ for French sites on the day of the overpass since September 2015 for Sentinel-2 derived vegetation indices and since October 2016 for HR-VPP products. To suppress spectral mixing effects at the parcel boundaries, as highlighted by Meier et al., 2020, the values from all datasets were subgrouped per field and then aggregated to a single median value for further analysis.
Sentinel-2 data was downloaded for all test sites from CREODIAS (https://creodias.eu/) in L2A processing level using a maximum scene-wide cloudy cover threshold of 75%. Scenes before 2017 were available in L1C processing level only. Scenes in L1C processing level were corrected for atmospheric effects after downloading using Sen2Cor (v2.9) with default settings. This was the same version used for the L2A scenes obtained intermediately from CREODIAS.
Next, the data was extracted from the Sentinel-2 scenes for each field parcel where only SCL classes 4 (vegetation) and 5 (bare soil) pixels were kept. We resampled the 20m band B8A to match the spatial resolution of the green and red band (10m) using nearest neighbor interpolation. The entire image processing chain was carried out using the open-source Python Earth Observation Data Analysis Library (EOdal) (Graf et al., 2022).
Apart from the widely used Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI), we included two recently proposed indices that were reported to have a higher correlation with photosynthesis and drought response of vegetation: These were the Near-Infrared Reflection of Vegetation (NIRv) (Badgley et al., 2017) and Kernel NDVI (kNDVI) (Camps-Valls et al., 2021). We calculated the vegetation indices in two different ways:
First, we used B08 as near-infrared (NIR) band which comes in a native spatial resolution of 10 m. B08 (central wavelength 833 nm) has a relatively coarse spectral resolution with a bandwidth of 106 nm.
Second, we used B8A which is available at 20 m spatial resolution. B8A differs from B08 in its central wavelength (864 nm) and has a narrower bandwidth (21 nm or 22 nm in the case of Sentinel-2A and 2B, respectively) compared to B08.
The High Resolution Vegetation Phenology and Productivity (HR-VPP) dataset from Copernicus Land Monitoring Service (CLMS) has three 10-m set products of Sentinel-2: vegetation indices, vegetation phenology and productivity parameters and seasonal trajectories (Tian et al., 2021). Both vegetation indices, Normalized Vegetation Index (NDVI) and Plant Phenology (PPI) and plant parameters, Fraction of Absorbed Photosynthetic Active Radiation (FAPAR) and Leaf Area Index (LAI) were computed for the time of Sentinel-2 overpass by the data provider.
NDVI is computed directly from B04 and B08 and PPI is computed using Difference Vegetation Index (DVI = B08 - B04) and its seasonal maximum value per pixel. FAPAR and LAI are retrieved from B03 and B04 and B08 with neural network training on PROSAIL model simulations. The dataset has a quality flag product (QFLAG2) which is a 16-bit that extends the scene classification band (SCL) of the Sentinel-2 Level-2 products. A “medium” filter was used to mask out QFLAG2 values from 2 to 1022, leaving land pixels (bit 1) within or outside cloud proximity (bits 11 and 13) or cloud shadow proximity (bits 12 and 14).
The HR-VPP daily raw vegetation indices products are described in detail in the user manual (Smets et al., 2022) and the computations details of PPI are given by Jin and Eklundh (2014). Seasonal trajectories refer to the 10-daily smoothed time-series of PPI used for vegetation phenology and productivity parameters retrieval with TIMESAT (Jönsson and Eklundh 2002, 2004).
HR-VPP data was downloaded through the WEkEO Copernicus Data and Information Access Services (DIAS) system with a Python 3.8.10 harmonized data access (HDA) API 0.2.1. Zonal statistics [’min’, ’max’, ’mean’, ’median’, ’count’, ’std’, ’majority’] were computed on non-masked pixel values within field boundaries with rasterstats Python package 0.17.00.
The Start of season date (SOSD), end of season date (EOSD) and length of seasons (LENGTH) were extracted from the annual Vegetation Phenology and Productivity Parameters (VPP) dataset as an additional source for comparison. These data are a product of the Vegetation Phenology and Productivity Parameters, see (https://land.copernicus.eu/pan-european/biophysical-parameters/high-resolution-vegetation-phenology-and-productivity/vegetation-phenology-and-productivity) for detailed information.
File Description:
4 datasets:
1_senseco_data_S2_B08_Bulgaria_France; 1_senseco_data_S2_B8A_Bulgaria_France; 1_senseco_data_HR_VPP_Bulgaria_France; 1_senseco_data_phenology_VPP_Bulgaria_France
3 metadata:
2_senseco_metadata_S2_B08_B8A_Bulgaria_France; 2_senseco_metadata_HR_VPP_Bulgaria_France; 2_senseco_metadata_phenology_VPP_Bulgaria_France
The dataset files “1_senseco_data_S2_B8_Bulgaria_France” and “1_senseco_data_S2_B8A_Bulgaria_France” concerns all vegetation indices (EVI, NDVI, kNDVI, NIRv) data values and related information, and metadata file “2_senseco_metadata_S2_B08_B8A_Bulgaria_France” describes all the existing variables. Both “1_senseco_data_S2_B8_Bulgaria_France” and “1_senseco_data_S2_B8A_Bulgaria_France” have the same column variable names and for that reason, they share the same metadata file “2_senseco_metadata_S2_B08_B8A_Bulgaria_France”.
The dataset file “1_senseco_data_HR_VPP_Bulgaria_France” concerns vegetation indices (NDVI, PPI) and plant parameters (LAI, FAPAR) data values and related information, and metadata file “2_senseco_metadata_HRVPP_Bulgaria_France” describes all the existing variables.
The dataset file “1_senseco_data_phenology_VPP_Bulgaria_France” concerns the vegetation phenology and productivity parameters (LENGTH, SOSD, EOSD) values and related information, and metadata file “2_senseco_metadata_VPP_Bulgaria_France” describes all the existing variables.
Bibliography
G. Badgley, C.B. Field, J.A. Berry, Canopy near-infrared reflectance and terrestrial photosynthesis, Sci. Adv. 3 (2017) e1602244. https://doi.org/10.1126/sciadv.1602244.
G. Camps-Valls, M. Campos-Taberner, Á. Moreno-Martínez, S. Walther, G. Duveiller, A. Cescatti, M.D. Mahecha, J. Muñoz-Marí, F.J. García-Haro, L. Guanter, M. Jung, J.A. Gamon, M. Reichstein, S.W. Running, A unified vegetation index for quantifying the terrestrial biosphere, Sci. Adv. 7 (2021) eabc7447. https://doi.org/10.1126/sciadv.abc7447.
L.V. Graf, G. Perich, H. Aasen, EOdal: An open-source Python package for large-scale agroecological research using Earth Observation and gridded environmental data, Comput. Electron. Agric. 203 (2022) 107487. https://doi.org/10.1016/j.compag.2022.107487.
H. Jin, L. Eklundh, A physically based vegetation index for improved monitoring of plant phenology, Remote Sens. Environ. 152 (2014) 512–525. https://doi.org/10.1016/j.rse.2014.07.010.
P. Jonsson, L. Eklundh, Seasonality extraction by function fitting to time-series of satellite sensor data, IEEE Trans. Geosci. Remote Sens. 40 (2002) 1824–1832. https://doi.org/10.1109/TGRS.2002.802519.
P. Jönsson, L. Eklundh, TIMESAT—a program for analyzing time-series of satellite sensor data, Comput. Geosci. 30 (2004) 833–845. https://doi.org/10.1016/j.cageo.2004.05.006.
J. Meier, W. Mauser, T. Hank, H. Bach, Assessments on the impact of high-resolution-sensor pixel sizes for common agricultural policy and smart farming services in European regions, Comput. Electron. Agric. 169 (2020) 105205. https://doi.org/10.1016/j.compag.2019.105205.
B. Smets, Z. Cai, L. Eklund, F. Tian, K. Bonte, R. Van Hoost, R. Van De Kerchove, S. Adriaensen, B. De Roo, T. Jacobs, F. Camacho, J. Sánchez-Zapero, S. Else, H. Scheifinger, K. Hufkens, P. Jönsson, HR-VPP Product User Manual Vegetation Indices, 2022.
F. Tian, Z. Cai, H. Jin, K. Hufkens, H. Scheifinger, T. Tagesson, B. Smets, R. Van Hoolst, K. Bonte, E. Ivits, X. Tong, J. Ardö, L. Eklundh, Calibrating vegetation phenology from Sentinel-2 using eddy covariance, PhenoCam, and PEP725 networks across Europe, Remote Sens. Environ. 260 (2021) 112456. https://doi.org/10.1016/j.rse.2021.112456.
This data release supports the paper titled, “Tungsten skarn potential of the Yukon-Tanana Uplands, Eastern Alaska, USA-A mineral resource assessment”, published via open-access license in the Journal of Geochemical Exploration and available at: https://doi.org/10.1016/j.gexplo.2020.106700. The data release includes GIS data that map potential for tungsten skarn mineralization in permissive tracts in the Yukon-Tanana Uplands, Eastern Alaska, along with tables listing keywords and procedures used to produce the permissive tracts and score them for mineral potential. Supplementary Data part A lists keywords used to extract permissive rock types from the Geologic Map of Alaska (Wilson et al., 2015) to generate the permissive tract for tungsten skarn. Supplementary Data part B describes the tract polishing procedures. Supplementary Data part C lists the parameters for scoring tungsten skarn mineralization potential within the permissive tract features. The GIS data are encapsulated in a file geodatabase called AK_Wskarn_tract.gdb and are also available in the shapefile and KML formats. The geodatabase contains three datasets. The polygon feature class “primary_attributes” contains the scored tungsten skarn permissive tract subdivided by National Hydrography Dataset HUC12 drainages. A related table, “qualitative_assessment” contains detailed scoring information for each feature. The point feature class “mineral_sites_ranked” contains W-bearing mineral sites pulled from the Alaska Resource Data File with additional fields added for this study. The GIS data folder also includes the Python script used to score potential. The datasets and methods are described in detail in the accompanying paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data is split into two parts according to the two experiments described within the article. The dataset includes movies and python codes for classifying emotions from experiment 1, and EEG and ERP measurements from experiment 2 along with associated code for analyzing those data.
Experiment 1 tests the validity of the SEED dataset collated by Zheng, Dong, & Lu (2014) and, subsequently, our own stimuli. The objective was to test whether previous literature using such datasets as the aformentioned dataset by Zheng et al. is purportedly classifying between emotions based on emotion-related signals of interest and/or non-emotional ‘noise’.
Experiment 2 used stimuli that have been well-validated within the psychological literature as reliably evoking specific embodiments of emotions within the viewer, namely the NimStim face and ADFES-BIV datasets with the objective of classifying a person's emotional status using EEG.
All data was processed and analyses run in MATLAB or Python. All datasets used are included within the folders accompanied by the MATLAB or Python scripts for collating separable matrices and running the action.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset for event encoded analog EEG signals for detection of Epileptic seizures
This dataset contains events that are encoded from the analog signals recorded during pre-surgical evaluations of patients at the Sleep-Wake-Epilepsy-Center (SWEC) of the University Department of Neurology at the Inselspital Bern. The analog signals are sourced from the SWEC-ETHZ iEEG Database
This database contains event streams for 10 seizures recorded from 5 patients and generated by the DYnamic Neuromorphic Asynchronous Processor (DYNAP-SE2) to demonstrate a proof-of-concept of encoding seizures with network synchronization. The pipeline consists of two parts (I) an Analog Front End (AFE) and (II) an SNN termed as"Non-Local Non-Global" (NLNG) network.
In the first part of the pipeline, the digitally recorded signals from SWEC-ETHZ iEEG Database are converted to analog signals via an 18-bit Digital-to-Analog converter (DAC) and then amplified and encoded into events by an Asynchronous Delta Modulator (ADM). Then in the second part, the encoded event streams are fed into the SNN that extracts the features of the epileptic seizure by extracting the partial synchronous patterns intrinsic to the seizure dynamics.
Details about the neuromorphic processing pipeline and the encoding process are included in a manuscript under review. The preprint is available in bioRxiv
InstallationThe installation requires Python>=3.x and conda (or py-venv) package. Users can then install the requirements inside a conda environment using
conda env create -f requirements.txt -n sez
Once created the conda environment can be activated with conda activate sez
The main files in the database are described in the hierarchy below.
EventSezDataset/
├─ data/
│ ├─ P x S x
│ │ ├─ Pat x Sz x _CH x .csv
├─ LSVM_Params/
│ ├─ opt_svm_params/
│ ├─ pat_x_features_SYNCH/
├─ fig_gen.py
├─ sync_mat_gen.py
├─ SeizDetection_FR.py
├─ SeizDetection_SYNCH.py
├─ support.py
├─ run.sh
├─ requirements.txt
where x represents the Patient ID and the Seizure ID respectively.
requirements.txt: This file lists the requirements for the execution of the Python code.
fig_gen.py: This file plots the analog signals and the associated AFE and NLNG event streams. The execution of the code happens with `python fig_gen.py 1 1 13', where patient 2, seizure 1, and channel 13 of the recording are plotted.
sync_mat_gen.py: This file describes the function for plotting the synchronization matrices emerging from the ADM and the NLNG spikes with either linear or log colorbar. The execution of the code happens with python sync_mat_gen.py 1 1' or
python sync_mat_gen.py 1 1 log'. This execution generated four figures for pre-seizure, First Half of seizure, Second Half of seizure, and post-seizure time periods, where patient 1 and seizure 1. The third option can either be left blank or input as lin
or log
, for respective color bar scales. The time is the signal-time as mentioned in the table below.
run.sh: A simple Linux script to run the above code for all patients and seizures.
SeizDetection_FR.py: This file runs the LSVM on the ADM and NLNG spikes, using the firing rate (FR) as a feature. The code is currently set up with plotting with pre-computed features (in the LSVM_Params/opt_svm_params/ folder). Users can use the code for training the LSVM with different parameters as well.
SeizDetection_SYNCH.py: This file runs the LSVM on the kernelized ADM and NLNG spikes, using the flattened SYNC matrices as a feature. The code is currently set up with plotting with pre-computed features (in the LSVM_Params/pat_x_features_SYNCH/ folder). Users can use the code for training the LSVM with different parameters as well.
LSVM_Params: Folder containing LSVM features with different parameter combinations.
support.py: This file contains the necessary functions.
data/P1S1/: This folder, for example, contains the event streams for all channels for seizure 1 of patient 1.
Pat1_Sz_1_CH1.csv: This file contains the spikes of the AFE and the NLNG layers with the following tabular format (which can be extracted by the fig_gen.py)
SYS_time signal_time dac_value ADMspikes NLNGspikes
The time from the interface FPGA The time of the signal as per the SWEC ETHZ Database The value of the analog signals as recorded in the SWEC ETHZ Database The event-steam is the output of the AFE in boolean format. True represents a spike The spike-steam is the output of the SNN in boolean format. True represents a spike
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Bayesian network modeling (BN modeling, or BNM) is an interpretable machine learning method for constructing probabilistic graphical models from the data. In recent years, it has been extensively applied to diverse types of biomedical data sets. Concurrently, our ability to perform long-time scale molecular dynamics (MD) simulations on proteins and other materials has increased exponentially. However, the analysis of MD simulation trajectories has not been data-driven but rather dependent on the user’s prior knowledge of the systems, thus limiting the scope and utility of the MD simulations. Recently, we pioneered using BNM for analyzing the MD trajectories of protein complexes. The resulting BN models yield novel fully data-driven insights into the functional importance of the amino acid residues that modulate proteins’ function. In this report, we describe the BaNDyT software package that implements the BNM specifically attuned to the MD simulation trajectories data. We believe that BaNDyT is the first software package to include specialized and advanced features for analyzing MD simulation trajectories using a probabilistic graphical network model. We describe here the software’s uses, the methods associated with it, and a comprehensive Python interface to the underlying generalist BNM code. This provides a powerful and versatile mechanism for users to control the workflow. As an application example, we have utilized this methodology and associated software to study how membrane proteins, specifically the G protein-coupled receptors, selectively couple to G proteins. The software can be used for analyzing MD trajectories of any protein as well as polymeric materials.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
📦 Software Defects Multilingual Dataset with AST & Token Features
This repository provides a dataset of 1,000 synthetic code functions across multiple programming languages for the purpose of software defect prediction, multilingual static analysis, and LLM evaluation.
🙋 Please Citation
If you use this dataset in your research or project, please cite it as:
"Ravikumar R N, Software Defects Multilingual Dataset with AST Features (2025). Generated by synthetic methods for defect prediction and multilingual code analysis."
🧠 Dataset Highlights
defect
(1 = buggy, 0 = clean)Features:
token_count
: Total tokens (AST-based for Python)num_ifs
, num_returns
, num_func_calls
: Code structure featuresast_nodes
: Number of nodes in the abstract syntax tree (Python only)lines_of_code
& cyclomatic_complexity
: Simulated metrics for modeling📊 Columns Description
Column | Description |
---|---|
function_name | Unique identifier for the function |
code | The actual function source code |
language | Programming language used |
lines_of_code | Approximate number of lines in the function |
cyclomatic_complexity | Simulated measure of decision complexity |
defect | 1 = buggy, 0 = clean |
token_count | Total token count (Python uses AST tokens) |
num_ifs | Count of 'if' statements |
num_returns | Count of 'return' statements |
num_func_calls | Number of function calls |
ast_nodes | AST node count (Python only, fallback = token count) |
🛠️ Usage Examples
This dataset is suitable for:
📎** License**
This dataset is synthetic and licensed under CC BY 4.0. Feel free to use, share, or adapt it with proper attribution.
This data repository contains the data sets and python scripts associated with the manuscript 'Machine learning isotropic g values of radical polymers '. Electron paramagnetic resonance measurements allow for obtaining experimental g values of radical polymers. Analogous to chemical shifts, g values give insight into the identity and environment of the paramagnetic center. In this work, Machine learning based prediction of g values is explored as a viable alternative to computationally expensive density functional theory (DFT) methods. Description of folder contents (switch to tree view): Datasets : Contains PTMA polymer structures from TR, TE-1, and TE-2 data sets transformed using a molecular descriptor (SOAP, MBTR or DAD) and corresponding DFT-calculated g values. Filenames contain 'PTMA_X' where X denotes the number of monomers which are radicals. Structure data sets have 'structure_data' in the title, DFT calculated g values have 'giso_DFT_data' in the title. The files are in .npy (NumPy) format. Models : ERT models trained on SOAP, MBTR and DAD feature vectors. Scripts : Contains scripts which can be used to predict g values from XYZ files of PTMA structures with 6 monomer units and varying radical density. The script 'prediction_functions.py' contains the functions which transform the XYZ coordinates into an appropriate feature vector which the trained model uses to predict. Description of individual functions are also given as docstrings (python documentation strings) in the code. The folder also contains additional files needed for the ERT-DAD model in .pkl format. XYZ_files : Contains atomic coordinates of PTMA structures in XYZ format. Two subfolders : WSD and TE-2 correspond to structures present in the whole structure data set and TE-2 test data set (see main text in the manuscript for details). Filenames in the folder 'XYZ_files/TE-2/PTMA-X/' are of the type 'chainlength_6ptma_Y'_Y''.xyz' where 'chainlength_6ptma' denotes the length of polymer chain (6 monomers), Y' denotes the proportion of monomers which are radicals (for instance, Y' = 50 means 3 out of 6 monomers are radicals) and Y'' denotes the order of the MD time frame. Actual time frame values of Y'' in ps is given in the manuscript. PTMA-ML.ipynb : Jupyter notebook detailing the workflow of generating the trained model. The file includes steps to load data sets, transform xyz files using molecular descriptors, optimise hyperparameters , train the model, cross validate using the training data set and evaluate the model. PTMA-ML.pdf : PTMA-ML.ipynb in PDF format. List of abbreviations : PTMA : poly(2,2,6,6-tetramethyl-1-piperidinyloxy-4-yl methacrylate) TR : Training data set TE-1 : Test data set 1 TE-2 : Test data set 2 ERT : Extremely randomized trees WSD : Whole structure data set SOAP : Smooth overlap of atomic orbitals MBTR : Many-body tensor representation DAD : Distances-Angles-Dihedrals
Dataset Card for "SPP_30K_verified_tasks"
Dataset Summary
This is an augmented version of the Synthetic Python Problems(SPP) Dataset. This dataset has been generated from the subset of the data has been de-duplicated and verified using a Python interpreter. (SPP_30k_verified.jsonl). The original dataset contains small Python functions that include a docstring with a small description of what the function does and some calling examples for the function. The current… See the full description on the dataset page: https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides micromagnetic simulation data collected from a series of computational experiments on the effects of polygonal system shape on the energy of different magnetic states in FeGe. The data here form the results of the study ‘Skyrmion states in thin confined polygonal nanostructures.’
The dataset is split into several directories:
Data
square-samples and triangle-samples
These directories contain final state ‘relaxed’ magnetization fields for square and triangle samples respectively. The files within are organised into directories such that a sample of side length d = 40nm and which was subjected to an applied field of 500mT is labelled d40b500. Within each directory are twelve VTK unstructured grid format files (with file extension “.vtu”). These can be viewed in a variety of programmes; as of the time of writing we recommend either ParaView or MayaVi. The twelve files correspond to twelve simulations for each sample simulated, corresponding to twelve states from which each sample was relaxed - these are described in the paper which this dataset accompanies, but we note the labels are:
‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘h’, ‘u’, ‘r1’, ‘r2’, ‘r3’, ‘h2’, ‘h3’
where:
0 - 4 are incomplete to overcomplete skyrmions,
h, h2 and h3 are helical states with different periodicities
r1-r3 are different random states
u is the uniform magnetisation
The vtu files are labelled according to parameters used in the simulation. For example, a file labelled ‘160_10_3_0_u_wd000000.vtu’ encodes that:
The simulation was of a sample with side length 160nm.
The simulation was of a sample of thickness 10nm.
The maximum length of an edge in the finite element mesh of the sample was 3nm.
The system was relaxed from the ‘u’.
‘wd’ encodes that the simulation was performed with a full demagnetizing calculation.
square-npys and triangle-npys
These directories contain computed information about each of the final states stored in square-samples and triangle-samples. This information is stored in NumPy npz files, and can be read in Python straightforwardly using the function numpy.load. Within each npz file, there are 8 arrays, each with 12 elements. These arrays are:
‘E’ - corresponds to the total energy of the relaxed state.
‘E_exchange’ - corresponds to the Exchange energy of the relaxed state.
‘E_demag’ - corresponds to the Demagnetizing energy of the relaxed state.
‘E_dmi’ - corresponds to the Dzyaloshinskii-Moriya energy of the relaxed state.
‘E_zeeman’ - corresponds to the Zeeman energy of the relaxed state.
‘S’ - Calculated Skyrmion number of the relaxed state.
‘S_abs’ - Calculated absolute Skyrmion number - see paper for calculation details.
‘m_av’ - Computed normalised average magnetisation in x, y, and z directions for relaxed state
The twelve elements here correspond to the aforementioned twelve states relaxed from, and the ordering of the array is that of the order given above.
square-classified and triangle-classified
These directories contain a labelled dataset which gives details about what the final state in each simulation is. The files are stored as plain text, and are labelled with the following structure (the meanings of which are defined in the paper which this dataset accompanies):
iSk - Incomplete Skyrmion
Sk, or a number n followed by Sk - n Skyrmions in the state.
He - A helical state
Target - A target state.
The files contain the names of png files which are generated from the vtu files in the format ‘d_165b_350_2.png’. This example, if found in the ‘Sk.txt’ file, means that the sample which was 165nm in side length and which was relaxed under a field of 350mT from initial state 2 was found at equilibrium in a Skyrmion state.
Figures
square-pngs and triangle-pngs
These directories contain generated pngs from the vtu files. These are included for convenience as they take several hours to generate. Each directory contains three subdirectories:
all-states
This directory contains the simulation results from all samples, in the format ‘d_165b_350_2.png’, which means that the image contained here is that of the 165nm side length sample relaxed under a 350mT field from initial state 2.
ground-state
This directory contains the images which correspond to the lowest energy state found from all of the initial states. These are labelled as ‘d_180b_50.png’, such that the image contained in this file is the the lowest energy state found from all twelve simulations of the 180nm sidelength under a 50mT field.
uniform-state
This directory contains the images which correspond to the states relaxed only from the uniform state. These are labelled such that an image labelled ‘d_55b_100.png’ is the state found from relaxing a 180nm sample under a 100mT applied field.
phase-diagrams
These are the generated phase diagrams which are found in the paper.
scripts
This folder contains Python scripts which generate the png files mentioned above, and also the phase diagram figures for the paper this dataset accompanies. The scripts are labelled descriptively with what they do - for e.g. ’triangle-generate-png-all-states.py’ contains the script which loads vtu files and generates the png files. The exception here is ’render.py’ which provides functions used across multiple scripts. These scripts can be modified - for example; the function 'export_vector_field' has many options which can be adjusted to, for example, plot different components of the magnetization.
In order to run the scripts reproducibly, in the root directory we have provided a Makefile which builds each component. In order to reproduce the figures yourself, on a Linux system, ParaView must be installed. The Makefile has been tested on Ubuntu 16.04 with ParaView 5.0.1. In addition, a number of Python dependencies must also be installed. These are:
scipy >=0.19.1
numpy >= 1.11.0
matplotlib == 1.5.2
pillow>=3.1.2
We have included a requirements.txt file which specifies these dependencies; they can be installed by running 'pip install -r requirements.txt' from the directory.
Once all dependencies are installed, simply run the command ‘make’ from the shell to build the Docker image and generate the figures. Note the scripts will take a long time to run - at the time of writing the runtime will be on the order of several hours on a high-specification desktop machine. For convenience, we have therefore included the generated figures within the repository (as noted above). It should be noted that for the versions used in the paper, adjustments have been made after the generation of the figures, (for e.g. to add images of states within the metastability figure, and overlaying boundaries in the phase diagrams).
If you want to reproduce only the phase diagrams, and not the pngs, the command ‘make phase-diagrams’ will do so. This is the smallest part of the figure reproduction, and takes around 5 minutes on a high-specification desktop.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a compilation of published measurements of leaf mesophyll conductance (gm) and accompanying leaf structural, anatomical, biochemical, and physiological traits as presented in: Knauer, J., Cuntz, M., Evans, J.R., Niinemets, Ü., Tosens, T., Veromann-Jürgenson, L.-L., Werner, C. and Zaehle, S. (2022), Contrasting anatomical and biochemical controls on mesophyll conductance across plant functional types, New Phytologist, doi:10.1111/nph.18363. Note that the compilation aims to represent unstressed, young, but fully expanded and high light-adapted leaves, albeit these criteria were not always explicitly stated. The reported measurements are assumed to be independent (i.e. from different set of plants) if one or several of species, cultivar/variety/genotype, population, measurement year (if annual/deciduous), age class (if woody), or growth environment is different. Only one (aggregated) gm value per set of plants is reported in the dataset. For further details on data collection and processing see reference above.
The file gm_dataset_Knauer_et_al_2022.xlsx contains the following sheets: - data: the main dataset including all variables and associated information such as species, measurement conditions, growing conditions etc. - columns_descriptions_units: a description of all columns in sheet 'data' as well as associated units (if applicable). - references: literature references of the data. Column 'refkey' can be used for cross-referencing with column 'refkey' in sheet 'data'. - references_methods: literature references for measurement methods of gm. Column 'refkey' can be used for cross-referencing with column 'method_reference' in sheet 'data'. - references_rubisco_parameters: literature references for rubisco parameters used in the studies. Column 'refkey' can be used for cross-referencing with columns 'Rubisco_constants_Ci_reference' and 'Rubisco_constants_Cc_reference' in sheet 'data'. The xlsx file can be imported into software environments such as R or python for further analysis. To read into R (tested for R version 4.1.2), a package such as readxl needs to be installed and loaded first, after which individual tabs can be imported using the read_xlsx() function. In python, the file can be imported using the pandas.read_excel command available from the pandas package.
The file aggregate_by_method.R provides code to read the xlsx dataset and aggregate by measurement method as described in the reference above.
For questions or comments please contact Dr. Jürgen Knauer (J.Knauer@westernsydney.edu.au).
This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
The dataset contains the predictions of maximum drawdown and time to maximum drawdown at all groundwater model nodes in the Namoi subregion, constrained by the observations of groundwater level, river flux and mine water production rates. The dataset also contains the scripts required for and the results of the sensitivity analysis. The dataset contains all the scripts to generate these results from the outputs of the groundwater model (Namoi groundwater model dataset) and all the spreadsheets with the results. The methodology and results are described in Janardhanan et al. (2017)
References
Janardhanan S, Crosbie R, Pickett T, Cui T, Peeters L, Slatter E, Northey J, Merrin LE, Davies P, Miotlinski K, Schmid W and Herr A (2017) Groundwater numerical modelling for the Namoi subregion. Product 2.6.2 for the Namoi subregion from the Northern Inland Catchments Bioregional Assessment. Department of the Environment and Energy, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia., http://data.bioregionalassessments.gov.au/product/NIC/NAM/2.6.2.
The workflow that underpins this dataset is captured in 'NAM_MF_UA_workflow.png'.
Spreadsheet NAM_MF_dmax_Predictions_all.csv is sourced from dataset 'Namoi groundwater model' and contains the name, coordinates, Bore_ID in the model, layer number, the name of the objective function and the minimum, maximum, median, 5th percentile and 95th percentile of the design of experiment runs of maximum drawdown (dmax) for each groundwater model node. The individual results for each node for each run of the design of experiment is stored in spreadsheet 'NAM_MF_dmax_DoE_Predictions_all.csv' The equivalent files for time to maximum drawdown (tmax) are 'NAM_MF_tmax_Predictions_all.csv' and 'NAM_MF_tmax_DoE_Predictions_all.csv'.
These files are combined with the file 'NAM_MF_Observations_all.csv', which contains the observed values for groundwater levels, mine dewatering rates and river flux, and the files NAM_MF_dist_hobs.csv, NAM_MF_dist_rivers.csv, NAM_MF_dist_mines.csv, which contain the distances of the predictions to each mine, groundwater level observation and river, in python script 'NAM_MF_datawranling.csv'. This script selects only those predictions where the 95th percentile of dmax is less than 1 cm for further analysis. The subset of predictions is stored in 'NAM_MF_dmax_Predictions.csv','NAM_MF_tmax_Predictions.csv', 'NAM_MF_dmax_DoE_Predictions.csv','NAM_MF_tmax_DoE_Predictions.csv'. The output spreadsheet 'NAM_MF_Observations.csv' has the observations and the distances to the selected predictions.
As the simulated equivalents to the observations are part of the predictions dataset, these files are combined in python script NAM_MF_OFs.py to generate the objective function values for each run and each prediction. The objective function values are weighted sums of the residuals, stored in NAM_MF_DoE_hres.csv, NAM_MF_DoE_mres.csv, NAM_MF_DoE_rres.csv, according to the distance to the predictions and the results are stored in NAM_MF_DoE_OFh.csv, NAM_MF_DoE_OFm.csv, NAM_MF_DoE_OFr.csv. The threshold values for each objective function and prediction are stored in NAM_MF_OF_thresholds.csv. Python script NAM_MF_OF_wrangling.py further post-processes this information to generate the acceptance rates, saved in spreadsheet NAM_MF_dmax_Predictions_ARs.csv
Python script NAM_MF_CreatePosterior.py selects the results from the design of experiment run that satisfy the acceptance criteria. The results form the posterior predictive distributions stored in NAM_MF_dmax_Posterior.csv and NAM_MF_tmax_Posterior.csv. These are further summarised in NAM_MF_Predictions_summary.csv.
The sensitivity analysis is done with script NAM_MF_SI.py, which uses the results of the design of experiment together with the parameter values, stored in NAM_MF_DoE_Parameters.csv and their description (name, range, transform) in NAM_MF_Parameters.csv. The resulting sensitivity indices for dmax, tmax and river, head and minewater flow observations are stored in NAM_MF_SI_dmax.csv, NAM_MF_SI_tmax.csv, NAM_MF_SI_river.csv, NAM_MF_SI_mine.csv and NAM_MF_SI_head.csv. The intermediate files, ending in xxxx, are the results grouped per 100 predictions. The scripts NAM_MF_SI_collate.py and NAM_MF_SI_collate.slurm collate these.
Bioregional Assessment Programme (2017) Namoi groundwater uncertainty analysis. Bioregional Assessment Derived Dataset. Viewed 11 December 2018, http://data.bioregionalassessments.gov.au/dataset/36bd27e9-58d2-4bf2-8e4a-54b22ac98cfb.
Derived From NSW Office of Water GW licence extract linked to spatial locations NIC v2 (28 February 2014)
Derived From Namoi hydraulic conductivity measurements
Derived From Namoi NGIS Bore analysis for 2012
Derived From Namoi groundwater model alluvium extent
Derived From Surface Geology of Australia, 1:1 000 000 scale, 2012 edition
Derived From Namoi Leapfrog geological model
Derived From Historical Mining Footprints DTIRIS NAM 20150914
Derived From Gippsland Project boundary
Derived From Bioregional Assessment areas v04
Derived From Natural Resource Management (NRM) Regions 2010
Derived From Soil and Landscape Grid National Soil Attribute Maps - Clay 3 resolution - Release 1
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Derived From Bioregional_Assessment_Programme_Catchment Scale Land Use of Australia - 2014
Derived From GEODATA TOPO 250K Series 3
Derived From NSW Office of Water Groundwater Licence Extract NIC- Oct 2013
Derived From Geological Provinces - Full Extent
Derived From Bioregional Assessment areas v03
Derived From BOM, Australian Average Rainfall Data from 1961 to 1990
Derived From GIS analysis of HYDMEAS - Hydstra Groundwater Measurement Update: NSW Office of Water - Nov2013
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From Australian 0.05º gridded chloride deposition v2
Derived From Hydstra Groundwater Measurement Update - NSW Office of Water, Nov2013
Derived From Namoi dryland diffuse groundwater recharge
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From Namoi groundwater model
Derived From Namoi bore locations, depth to water for June 2012
Derived From NSW Office of Water Groundwater Entitlements Spatial Locations
Derived From Victoria - Seamless Geology 2014
Derived From Namoi NSW Office of Water groundwater licence BA purpose
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains the code (`aws-triggers` and `azure-trigger`), data analysis scripts (`data-analysis`), and dataset (`data`) of the TriggerBench cross-provider serverless benchmark.
It also bundles a customized extension of the `serverless-benchmarker` tool to automate and analyze serverless performance experiments.
TriggerBench
The Github repository joe4dev/trigger-bench contains the last version of TriggerBench. This replication package describes the version for the paper "TriggerBench: A Performance Benchmark for Serverless Function Triggers".
TriggerBench currently supports three triggers on AWS and eight triggers on Microsoft Azure.
Dataset
The `data/aws` and `data/azure` directories contain data from benchmark executions from April 2022.
Each execution is a separate directory with a timestamp in the format `yyyy-mm-dd-HH-MM-SS` (e.g., `2022-04-15_21-58-52`) and contains the following files:
Replicate Data Analysis
Installation
1. Install [Python](https://www.python.org/downloads/) 3.10+
2. Install Python dependencies `pip install -r requirements.txt`
Create Plots
1. Run `python plots.py` generates the plots and the statistical summaries presented in the paper.
By default, the plots will be saved into a `plots` sub-directory.
An alternative output directory can be configured through the environment variable `PLOTS_PATH`.
> Hint: For interactive development, we recommend the VSCode [Python extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python) in [interactive mode](https://youtu.be/lwN4-W1WR84?t=107).
Replicate Cloud Experiments
The following experiment plan automates benchmarking experiments with different types workloads (constant and bursty).
This generates a new dataset in the same format as described above.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Normalized variances calculated using the method described in the article, based on experimental data. Data is stored using Xarray, specifically in the NetCDF format. Data can be easily accessed using the Xarray Python library, specifically by calling xarray.open_dataset() The dataset is structured as follows: two N-dimensional DataArrays, one corresponding for calculations with time displacements (labeled as time) and one for calculations with phase displacements with the time centroid already picked (labeled as final) each DataArray has 5 dimensions: SNR, eps (separation), ph_disp/disp (displacement), sample/sample_time (bootstrapped sample), supersample (ensemble of bootstrapped samples) coordinates label the parameters along each dimension Usage examples Opening the dataset import numpy as np import xarray as xr variances = xr.open_dataset("coherent.nc") Obtaining parameter estimates def get_centroid_indices(variances): return np.bincount( variances.argmin( dim="disp" if "disp" in variances.dims else "ph_disp" ).values.flatten() ) def get_centroid_index(variances): return np.argmax(get_centroid_indices(variances)) def epsilon_estimator(eps): return 4 * np.sqrt(np.clip(var, 0, None)) time_centroid_estimates = variances["time"].idxmin(dim="disp") phase_centroid_estimates = variances["final"].idxmin(dim="ph_disp") epsilon_estimates = eps_estimator( variances["final"].isel(ph_disp=common.get_centroid_index(variances["final"])) ) Calculating and plotting precision def plot(estimates): estimator_variances = estimates.var( dim="sample" if "sample" in estimates.dims else "sample_time" ) precision = ( 1.0 / estimator_variances.snr / variances.attrs["SAMPLE_SIZE"] / estimator_variances ) precision = precision.where(xr.apply_ufunc(np.isfinite, precision), other=0) mean_precision = precision.mean(dim="supersample") mean_precision = mean_precision.where(np.isfinite(mean_precision), 0) precision_error = 2 * precision.std(dim="supersample").fillna(0) g = mean_precision.plot.scatter( x="eps", col="snr", col_wrap=2, sharex=True, sharey=True, ) for ax, snr in zip(g.axs.flat, snrs): ax.errorbar( precision.eps.values, mean_precision.sel(snr=snr), yerr=precision_error.sel(snr=snr), fmt="o", ) plot(time_centroid_estimates) plot(phase_centroid_estimates) plot(epsilon_estimates)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
OverviewHessian QM9 is the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $\omega$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as in water, tetrahydrofuran, and toluene using an implicit solvation model.A pre-print article associated with this dataset is available at here.Data recordsThe dataset is stored in Hugging Face's dataset format. For each of the four implicit solvent environments (vacuum, THF, toluene, and water), the data is divided into separate datasets containing vibrational analysis of 41,645 optimized geometries. Labels are associated with the QM9 molecule labelling system given by Ramakrishnan et al.Please note that only molecules containing H, C, N, O were considered. This exclusion was due to the limited number of molecules containing fluorine in the QM9 dataset, which was not sufficient to build a good description of the chemical environment for fluorine atoms. Including these molecules may have reduced the overall precision of any models trained on our data.Load the dataset:Use the following Python script to load the dataset dictionary: pythonfrom datasets import load_from_diskdataset = load_from_disk(root_directory)print(dataset)
Expected output:pythonDatasetDict({vacuum: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),thf: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),toluene: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),water: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645})})
DFT MethodsAll DFT calculations were carried out using the NWChem software package. The density functional used was $\omega$B97x with a 6-31G* basis set to create data compatible with the ANI-1/ANI-1x/ANI-2x datasets. The self-consistent field (SCF) cycle was converged when changes in total energy and density were less than 1e-6 eV. All molecules in the set are neutral with a multiplicity of 1. The Mura-Knowles radial quadrature and Lebedev angular quadrature were used in the integration. Structures were optimized in vacuum and three solvents (tetrahydrofuran, toluene, and water) using an implicit solvation model.The Hessian matrices, vibrational frequencies, and normal modes were computed for a subset of 41,645 molecular geometries using the finite differences method.Example model weightsAn example model trained on Hessian data is included in this dataset. Full details of the model will be provided in an upcoming publication. The model is an E(3)-equivariant graph neural network using the e3x
package with specific architecture details. To load the model weights, use:pythonparams = jnp.load('params_train_f128_i5_b16.npz', allow_pickle=True)['params'].item()
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9
conda activate flylight-env
pip install zarr
import zarr
raw = zarr.open(
seg = zarr.open(
# optional:
import numpy as np
raw_np = np.array(raw)
Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.
We recommend to use napari to view the image data.
pip install "napari[all]"
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)
for idx, gt in enumerate(gts):
viewer.add_labels(
gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
napari.run()
python view_data.py
For more information on our selected metrics and formal definitions please see our paper.
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe,
title = {FISBe: A real-world benchmark dataset for instance
segmentation of long-range thin filamentous structures},
author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya
Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena
Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller},
year = 2024,
eprint = {2404.00130},
archivePrefix ={arXiv},
primaryClass = {cs.CV}
}
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.
There have been no changes to the dataset so far.
All future change will be listed on the changelog page.
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General description This item contains all the data, code and analysis objects used in the paper: "Real-time single-molecule 3D tracking in E. coli based on cross-entropy minimization"
Elias Amselem*, Bo Broadwater*, Tora Hävermark, Magnus Johansson & Johan Elf. Dept. Cell and Molecular Biology, Uppsala University, Sweden" *Equal contribution
We present a 3D tracking principle that approaches the sub-ms regime. The method is based on the true excitation point spread function and cross-entropy minimization for position localization of moving fluorescent reporters. Our implementation also features a new method for microsecond 3D point spread function positioning and a new estimator for diffusion analysis of tracking data. We successfully applied these methods to track the Trigger Factor protein in living bacterial cells.
Experimental data description
The data provided in this repository is generated by the microscope described in the publication mentioned above. The underlying real-time tracking principle and methods used are outlined and evaluated using the code base and data found in this repository. This includes; the trajectory reconstruction based on the cross-entropy minimization, the extended covariance estimator (ECVE) for diffusion, simulation for evaluating both trajectory reconstruction and ECVE method, and the Trigger Factor live cell E. coli data with analysis.
Each entry includes the raw data with analysis code. In each entry under the folder "TriggerFactor_Code\ProjectMain" is the main analysis python file, this file includes instructions on how to run the script. Also, in the folder there are preconfigured main files to generate data used in the manuscript, this is also available in the ScourceData.zip file.
To understand how to use the code and the structure used please see entry: 20220722_EXP-22-BL9428_Example_Analysis and the README.txt file included. Here you find the python requirements (dependencies and versions), and instructions on how to run simulations.
Dataset Card for "Magicoder-Evol-Instruct-110K-python"
from datasets import load_dataset
dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split
def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search