51 datasets found

h
Magicoder-Evol-Instruct-110K-python
huggingface.co
Updated Nov 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pxy (2024). Magicoder-Evol-Instruct-110K-python [Dataset]. https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2024
Authors
pxy
Description
Dataset Card for "Magicoder-Evol-Instruct-110K-python"

from datasets import load_dataset

Load your dataset

dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split

Define a filter function

def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search

… See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.

Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem - the dataset [Dataset]. http://doi.org/10.5281/zenodo.1297925

Explore at:

text/x-python, zip, bin, application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1297925

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

SELTO Dataset
zenodo.org
application/gzip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7781392
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7781392
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values] Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
h
python-treesitter-filtered-datasetsV2
huggingface.co
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuu12121 (2025). python-treesitter-filtered-datasetsV2 [Dataset]. https://huggingface.co/datasets/Shuu12121/python-treesitter-filtered-datasetsV2
Explore at:
Dataset updated
Jul 11, 2025
Authors
Shuu12121
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Python CodeSearch Dataset (Shuu12121/python-treesitter-filtered-datasetsV2)

Dataset Description

This dataset contains Python functions paired with their documentation strings (docstrings), extracted from open-source Python repositories on GitHub. It is formatted similarly to the CodeSearchNet challenge dataset. Each entry includes:

code: The source code of a python function or method. docstring: The docstring or Javadoc associated with the function/method. func_name: The… See the full description on the dataset page: https://huggingface.co/datasets/Shuu12121/python-treesitter-filtered-datasetsV2.
Data from: Satellite remote sensing dataset of Sentinel-2 for phenology...
zenodo.org
producciocientifica.uv.es
+1more
txt
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dessislava Ganeva; Dessislava Ganeva; Lukas Graf Valentin; Lukas Graf Valentin; Egor Prikaziuk; Egor Prikaziuk; Gerbrand Koren; Gerbrand Koren; Enrico Tomelleri; Enrico Tomelleri; Jochem Verrelst; Jochem Verrelst; Katja Berger; Katja Berger; Santiago Belda; Santiago Belda; Zhanzhang Cai; Zhanzhang Cai; Cláudio Silva Figueira; Cláudio Silva Figueira (2023). Satellite remote sensing dataset of Sentinel-2 for phenology metrics extraction from sites in Bulgaria and France [Dataset]. http://doi.org/10.5281/zenodo.7825727
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7825727
Dataset updated
Apr 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dessislava Ganeva; Dessislava Ganeva; Lukas Graf Valentin; Lukas Graf Valentin; Egor Prikaziuk; Egor Prikaziuk; Gerbrand Koren; Gerbrand Koren; Enrico Tomelleri; Enrico Tomelleri; Jochem Verrelst; Jochem Verrelst; Katja Berger; Katja Berger; Santiago Belda; Santiago Belda; Zhanzhang Cai; Zhanzhang Cai; Cláudio Silva Figueira; Cláudio Silva Figueira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
France, Bulgaria
Description
Site Description:

In this dataset, there are seventeen production crop fields in Bulgaria where winter rapeseed and wheat were grown and two research fields in France where winter wheat – rapeseed – barley – sunflower and winter wheat – irrigated maize crop rotation is used. The full description of those fields is in the database "In-situ crop phenology dataset from sites in Bulgaria and France" (doi.org/10.5281/zenodo.7875440).

Methodology and Data Description:

Remote sensing data is extracted from Sentinel-2 tiles 35TNJ for Bulgarian sites and 31TCJ for French sites on the day of the overpass since September 2015 for Sentinel-2 derived vegetation indices and since October 2016 for HR-VPP products. To suppress spectral mixing effects at the parcel boundaries, as highlighted by Meier et al., 2020, the values from all datasets were subgrouped per field and then aggregated to a single median value for further analysis.

Sentinel-2 data was downloaded for all test sites from CREODIAS (https://creodias.eu/) in L2A processing level using a maximum scene-wide cloudy cover threshold of 75%. Scenes before 2017 were available in L1C processing level only. Scenes in L1C processing level were corrected for atmospheric effects after downloading using Sen2Cor (v2.9) with default settings. This was the same version used for the L2A scenes obtained intermediately from CREODIAS.

Next, the data was extracted from the Sentinel-2 scenes for each field parcel where only SCL classes 4 (vegetation) and 5 (bare soil) pixels were kept. We resampled the 20m band B8A to match the spatial resolution of the green and red band (10m) using nearest neighbor interpolation. The entire image processing chain was carried out using the open-source Python Earth Observation Data Analysis Library (EOdal) (Graf et al., 2022).

Apart from the widely used Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI), we included two recently proposed indices that were reported to have a higher correlation with photosynthesis and drought response of vegetation: These were the Near-Infrared Reflection of Vegetation (NIRv) (Badgley et al., 2017) and Kernel NDVI (kNDVI) (Camps-Valls et al., 2021). We calculated the vegetation indices in two different ways:

First, we used B08 as near-infrared (NIR) band which comes in a native spatial resolution of 10 m. B08 (central wavelength 833 nm) has a relatively coarse spectral resolution with a bandwidth of 106 nm.

Second, we used B8A which is available at 20 m spatial resolution. B8A differs from B08 in its central wavelength (864 nm) and has a narrower bandwidth (21 nm or 22 nm in the case of Sentinel-2A and 2B, respectively) compared to B08.

The High Resolution Vegetation Phenology and Productivity (HR-VPP) dataset from Copernicus Land Monitoring Service (CLMS) has three 10-m set products of Sentinel-2: vegetation indices, vegetation phenology and productivity parameters and seasonal trajectories (Tian et al., 2021). Both vegetation indices, Normalized Vegetation Index (NDVI) and Plant Phenology (PPI) and plant parameters, Fraction of Absorbed Photosynthetic Active Radiation (FAPAR) and Leaf Area Index (LAI) were computed for the time of Sentinel-2 overpass by the data provider.

NDVI is computed directly from B04 and B08 and PPI is computed using Difference Vegetation Index (DVI = B08 - B04) and its seasonal maximum value per pixel. FAPAR and LAI are retrieved from B03 and B04 and B08 with neural network training on PROSAIL model simulations. The dataset has a quality flag product (QFLAG2) which is a 16-bit that extends the scene classification band (SCL) of the Sentinel-2 Level-2 products. A “medium” filter was used to mask out QFLAG2 values from 2 to 1022, leaving land pixels (bit 1) within or outside cloud proximity (bits 11 and 13) or cloud shadow proximity (bits 12 and 14).

The HR-VPP daily raw vegetation indices products are described in detail in the user manual (Smets et al., 2022) and the computations details of PPI are given by Jin and Eklundh (2014). Seasonal trajectories refer to the 10-daily smoothed time-series of PPI used for vegetation phenology and productivity parameters retrieval with TIMESAT (Jönsson and Eklundh 2002, 2004).

HR-VPP data was downloaded through the WEkEO Copernicus Data and Information Access Services (DIAS) system with a Python 3.8.10 harmonized data access (HDA) API 0.2.1. Zonal statistics [’min’, ’max’, ’mean’, ’median’, ’count’, ’std’, ’majority’] were computed on non-masked pixel values within field boundaries with rasterstats Python package 0.17.00.

The Start of season date (SOSD), end of season date (EOSD) and length of seasons (LENGTH) were extracted from the annual Vegetation Phenology and Productivity Parameters (VPP) dataset as an additional source for comparison. These data are a product of the Vegetation Phenology and Productivity Parameters, see (https://land.copernicus.eu/pan-european/biophysical-parameters/high-resolution-vegetation-phenology-and-productivity/vegetation-phenology-and-productivity) for detailed information.

File Description:

4 datasets:

1_senseco_data_S2_B08_Bulgaria_France; 1_senseco_data_S2_B8A_Bulgaria_France; 1_senseco_data_HR_VPP_Bulgaria_France; 1_senseco_data_phenology_VPP_Bulgaria_France

3 metadata:

2_senseco_metadata_S2_B08_B8A_Bulgaria_France; 2_senseco_metadata_HR_VPP_Bulgaria_France; 2_senseco_metadata_phenology_VPP_Bulgaria_France

The dataset files “1_senseco_data_S2_B8_Bulgaria_France” and “1_senseco_data_S2_B8A_Bulgaria_France” concerns all vegetation indices (EVI, NDVI, kNDVI, NIRv) data values and related information, and metadata file “2_senseco_metadata_S2_B08_B8A_Bulgaria_France” describes all the existing variables. Both “1_senseco_data_S2_B8_Bulgaria_France” and “1_senseco_data_S2_B8A_Bulgaria_France” have the same column variable names and for that reason, they share the same metadata file “2_senseco_metadata_S2_B08_B8A_Bulgaria_France”.

The dataset file “1_senseco_data_HR_VPP_Bulgaria_France” concerns vegetation indices (NDVI, PPI) and plant parameters (LAI, FAPAR) data values and related information, and metadata file “2_senseco_metadata_HRVPP_Bulgaria_France” describes all the existing variables.

The dataset file “1_senseco_data_phenology_VPP_Bulgaria_France” concerns the vegetation phenology and productivity parameters (LENGTH, SOSD, EOSD) values and related information, and metadata file “2_senseco_metadata_VPP_Bulgaria_France” describes all the existing variables.

Bibliography

G. Badgley, C.B. Field, J.A. Berry, Canopy near-infrared reflectance and terrestrial photosynthesis, Sci. Adv. 3 (2017) e1602244. https://doi.org/10.1126/sciadv.1602244.

G. Camps-Valls, M. Campos-Taberner, Á. Moreno-Martínez, S. Walther, G. Duveiller, A. Cescatti, M.D. Mahecha, J. Muñoz-Marí, F.J. García-Haro, L. Guanter, M. Jung, J.A. Gamon, M. Reichstein, S.W. Running, A unified vegetation index for quantifying the terrestrial biosphere, Sci. Adv. 7 (2021) eabc7447. https://doi.org/10.1126/sciadv.abc7447.

L.V. Graf, G. Perich, H. Aasen, EOdal: An open-source Python package for large-scale agroecological research using Earth Observation and gridded environmental data, Comput. Electron. Agric. 203 (2022) 107487. https://doi.org/10.1016/j.compag.2022.107487.

H. Jin, L. Eklundh, A physically based vegetation index for improved monitoring of plant phenology, Remote Sens. Environ. 152 (2014) 512–525. https://doi.org/10.1016/j.rse.2014.07.010.

P. Jonsson, L. Eklundh, Seasonality extraction by function fitting to time-series of satellite sensor data, IEEE Trans. Geosci. Remote Sens. 40 (2002) 1824–1832. https://doi.org/10.1109/TGRS.2002.802519.

P. Jönsson, L. Eklundh, TIMESAT—a program for analyzing time-series of satellite sensor data, Comput. Geosci. 30 (2004) 833–845. https://doi.org/10.1016/j.cageo.2004.05.006.

J. Meier, W. Mauser, T. Hank, H. Bach, Assessments on the impact of high-resolution-sensor pixel sizes for common agricultural policy and smart farming services in European regions, Comput. Electron. Agric. 169 (2020) 105205. https://doi.org/10.1016/j.compag.2019.105205.

B. Smets, Z. Cai, L. Eklund, F. Tian, K. Bonte, R. Van Hoost, R. Van De Kerchove, S. Adriaensen, B. De Roo, T. Jacobs, F. Camacho, J. Sánchez-Zapero, S. Else, H. Scheifinger, K. Hufkens, P. Jönsson, HR-VPP Product User Manual Vegetation Indices, 2022.

F. Tian, Z. Cai, H. Jin, K. Hufkens, H. Scheifinger, T. Tagesson, B. Smets, R. Van Hoolst, K. Bonte, E. Ivits, X. Tong, J. Ardö, L. Eklundh, Calibrating vegetation phenology from Sentinel-2 using eddy covariance, PhenoCam, and PEP725 networks across Europe, Remote Sens. Environ. 260 (2021) 112456. https://doi.org/10.1016/j.rse.2021.112456.
d
Data from: Qualitative Mineral Potential Map of Tungsten Skarn in the...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Qualitative Mineral Potential Map of Tungsten Skarn in the Yukon-Tanana Uplands, Eastern Alaska, USA, 2021 [Dataset]. https://catalog.data.gov/dataset/qualitative-mineral-potential-map-of-tungsten-skarn-in-the-yukon-tanana-uplands-eastern-al
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Alaska, Tanana Hills, United States
Description
This data release supports the paper titled, “Tungsten skarn potential of the Yukon-Tanana Uplands, Eastern Alaska, USA-A mineral resource assessment”, published via open-access license in the Journal of Geochemical Exploration and available at: https://doi.org/10.1016/j.gexplo.2020.106700. The data release includes GIS data that map potential for tungsten skarn mineralization in permissive tracts in the Yukon-Tanana Uplands, Eastern Alaska, along with tables listing keywords and procedures used to produce the permissive tracts and score them for mineral potential. Supplementary Data part A lists keywords used to extract permissive rock types from the Geologic Map of Alaska (Wilson et al., 2015) to generate the permissive tract for tungsten skarn. Supplementary Data part B describes the tract polishing procedures. Supplementary Data part C lists the parameters for scoring tungsten skarn mineralization potential within the permissive tract features. The GIS data are encapsulated in a file geodatabase called AK_Wskarn_tract.gdb and are also available in the shapefile and KML formats. The geodatabase contains three datasets. The polygon feature class “primary_attributes” contains the scored tungsten skarn permissive tract subdivided by National Hydrography Dataset HUC12 drainages. A related table, “qualitative_assessment” contains detailed scoring information for each feature. The point feature class “mineral_sites_ranked” contains W-bearing mineral sites pulled from the Alaska Resource Data File with additional fields added for this study. The GIS data folder also includes the Python script used to score potential. The datasets and methods are described in detail in the accompanying paper.
U
Dataset for article entitled "An empirical evaluation of methodologies used...
researchdata.bath.ac.uk
docx, zip
Updated Jan 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neal Hinvest; Chris Ashwin; Felix Carter; James Hook; Laura G. E. Smith; George Stothart (2022). Dataset for article entitled "An empirical evaluation of methodologies used for emotion recognition via EEG signals" [Dataset]. http://doi.org/10.15125/BATH-00899
Explore at:
zip, docxAvailable download formats
Unique identifier
https://doi.org/10.15125/BATH-00899
Dataset updated
Jan 30, 2022
Dataset provided by
University of Bath
Authors
Neal Hinvest; Chris Ashwin; Felix Carter; James Hook; Laura G. E. Smith; George Stothart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
Leverhulme Trust
Description
The data is split into two parts according to the two experiments described within the article. The dataset includes movies and python codes for classifying emotions from experiment 1, and EEG and ERP measurements from experiment 2 along with associated code for analyzing those data.

Experiment 1 tests the validity of the SEED dataset collated by Zheng, Dong, & Lu (2014) and, subsequently, our own stimuli. The objective was to test whether previous literature using such datasets as the aformentioned dataset by Zheng et al. is purportedly classifying between emotions based on emotion-related signals of interest and/or non-emotional ‘noise’.

Experiment 2 used stimuli that have been well-validated within the psychological literature as reliably evoking specific embodiments of emotions within the viewer, namely the NimStim face and ADFES-BIV datasets with the objective of classifying a person's emotional status using EEG.

All data was processed and analyses run in MATLAB or Python. All datasets used are included within the folders accompanied by the MATLAB or Python scripts for collating separable matrices and running the action.
Z
Spiking Seizure Classification Dataset
data.niaid.nih.gov
Updated Jan 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gallou, Olympia (2025). Spiking Seizure Classification Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10800793
Explore at:
Dataset updated
Jan 13, 2025
Dataset provided by
Matthew, Cook
Gallou, Olympia
GHOSH, SAPTARSHI
Ito, Hiroyuki
Bartels, Jim
Indiveri, Giacomo
Sarnthein, Johannes
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset for event encoded analog EEG signals for detection of Epileptic seizures

This dataset contains events that are encoded from the analog signals recorded during pre-surgical evaluations of patients at the Sleep-Wake-Epilepsy-Center (SWEC) of the University Department of Neurology at the Inselspital Bern. The analog signals are sourced from the SWEC-ETHZ iEEG Database

This database contains event streams for 10 seizures recorded from 5 patients and generated by the DYnamic Neuromorphic Asynchronous Processor (DYNAP-SE2) to demonstrate a proof-of-concept of encoding seizures with network synchronization. The pipeline consists of two parts (I) an Analog Front End (AFE) and (II) an SNN termed as"Non-Local Non-Global" (NLNG) network.

In the first part of the pipeline, the digitally recorded signals from SWEC-ETHZ iEEG Database are converted to analog signals via an 18-bit Digital-to-Analog converter (DAC) and then amplified and encoded into events by an Asynchronous Delta Modulator (ADM). Then in the second part, the encoded event streams are fed into the SNN that extracts the features of the epileptic seizure by extracting the partial synchronous patterns intrinsic to the seizure dynamics.

Details about the neuromorphic processing pipeline and the encoding process are included in a manuscript under review. The preprint is available in bioRxiv

InstallationThe installation requires Python>=3.x and conda (or py-venv) package. Users can then install the requirements inside a conda environment using

conda env create -f requirements.txt -n sez

Once created the conda environment can be activated with conda activate sez

The main files in the database are described in the hierarchy below.

EventSezDataset/

├─ data/

│ ├─ P x S x

│ │ ├─ Pat x Sz x _CH x .csv

├─ LSVM_Params/

│ ├─ opt_svm_params/

│ ├─ pat_x_features_SYNCH/

├─ fig_gen.py

├─ sync_mat_gen.py

├─ SeizDetection_FR.py

├─ SeizDetection_SYNCH.py

├─ support.py

├─ run.sh

├─ requirements.txt

where x represents the Patient ID and the Seizure ID respectively.

requirements.txt: This file lists the requirements for the execution of the Python code.

fig_gen.py: This file plots the analog signals and the associated AFE and NLNG event streams. The execution of the code happens with `python fig_gen.py 1 1 13', where patient 2, seizure 1, and channel 13 of the recording are plotted.

sync_mat_gen.py: This file describes the function for plotting the synchronization matrices emerging from the ADM and the NLNG spikes with either linear or log colorbar. The execution of the code happens with python sync_mat_gen.py 1 1' orpython sync_mat_gen.py 1 1 log'. This execution generated four figures for pre-seizure, First Half of seizure, Second Half of seizure, and post-seizure time periods, where patient 1 and seizure 1. The third option can either be left blank or input as lin or log, for respective color bar scales. The time is the signal-time as mentioned in the table below.

run.sh: A simple Linux script to run the above code for all patients and seizures.

SeizDetection_FR.py: This file runs the LSVM on the ADM and NLNG spikes, using the firing rate (FR) as a feature. The code is currently set up with plotting with pre-computed features (in the LSVM_Params/opt_svm_params/ folder). Users can use the code for training the LSVM with different parameters as well.

SeizDetection_SYNCH.py: This file runs the LSVM on the kernelized ADM and NLNG spikes, using the flattened SYNC matrices as a feature. The code is currently set up with plotting with pre-computed features (in the LSVM_Params/pat_x_features_SYNCH/ folder). Users can use the code for training the LSVM with different parameters as well.

LSVM_Params: Folder containing LSVM features with different parameter combinations.

support.py: This file contains the necessary functions.

data/P1S1/: This folder, for example, contains the event streams for all channels for seizure 1 of patient 1.

Pat1_Sz_1_CH1.csv: This file contains the spikes of the AFE and the NLNG layers with the following tabular format (which can be extracted by the fig_gen.py)

Comments

SStart: 180 //Start of the Seizure in signal time# SEnd: 276.0 //Start of the Seizure in signal time# Pid: 2 // The patient ID as per the SWEC-ETHZ iEEG Database # Sid: 1 // The Seizure ID as per the SWEC-ETHZ iEEG Database # Channel_No: 1 // The channel number

SYS_time signal_time dac_value ADMspikes NLNGspikes

The time from the interface FPGA The time of the signal as per the SWEC ETHZ Database The value of the analog signals as recorded in the SWEC ETHZ Database The event-steam is the output of the AFE in boolean format. True represents a spike The spike-steam is the output of the SNN in boolean format. True represents a spike
f
Data from: BaNDyT: Bayesian Network Modeling of Molecular Dynamics...
acs.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizaveta Mukhaleva; Babgen Manookian; Hanyu Chen; Indira R. Sivaraj; Ning Ma; Wenyuan Wei; Konstancja Urbaniak; Grigoriy Gogoshin; Supriyo Bhattacharya; Nagarajan Vaidehi; Andrei S. Rodin; Sergio Branciamore (2025). BaNDyT: Bayesian Network Modeling of Molecular Dynamics Trajectories [Dataset]. http://doi.org/10.1021/acs.jcim.4c01981.s003
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.4c01981.s003
Dataset updated
Jan 23, 2025
Dataset provided by
ACS Publications
Authors
Elizaveta Mukhaleva; Babgen Manookian; Hanyu Chen; Indira R. Sivaraj; Ning Ma; Wenyuan Wei; Konstancja Urbaniak; Grigoriy Gogoshin; Supriyo Bhattacharya; Nagarajan Vaidehi; Andrei S. Rodin; Sergio Branciamore
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Bayesian network modeling (BN modeling, or BNM) is an interpretable machine learning method for constructing probabilistic graphical models from the data. In recent years, it has been extensively applied to diverse types of biomedical data sets. Concurrently, our ability to perform long-time scale molecular dynamics (MD) simulations on proteins and other materials has increased exponentially. However, the analysis of MD simulation trajectories has not been data-driven but rather dependent on the user’s prior knowledge of the systems, thus limiting the scope and utility of the MD simulations. Recently, we pioneered using BNM for analyzing the MD trajectories of protein complexes. The resulting BN models yield novel fully data-driven insights into the functional importance of the amino acid residues that modulate proteins’ function. In this report, we describe the BaNDyT software package that implements the BNM specifically attuned to the MD simulation trajectories data. We believe that BaNDyT is the first software package to include specialized and advanced features for analyzing MD simulation trajectories using a probabilistic graphical network model. We describe here the software’s uses, the methods associated with it, and a comprehensive Python interface to the underlying generalist BNM code. This provides a powerful and versatile mechanism for users to control the workflow. As an application example, we have utilized this methodology and associated software to study how membrane proteins, specifically the G protein-coupled receptors, selectively couple to G proteins. The software can be used for analyzing MD trajectories of any protein as well as polymeric materials.

Software Defects Dataset 1k

kaggle.com

Updated Jun 16, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ravikumar R N (2025). Software Defects Dataset 1k [Dataset]. https://www.kaggle.com/datasets/ravikumarrn/software-defects-dataset-1k/versions/1

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 16, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ravikumar R N

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

📦 Software Defects Multilingual Dataset with AST & Token Features

This repository provides a dataset of 1,000 synthetic code functions across multiple programming languages for the purpose of software defect prediction, multilingual static analysis, and LLM evaluation.

🙋 Please Citation

If you use this dataset in your research or project, please cite it as:

"Ravikumar R N, Software Defects Multilingual Dataset with AST Features (2025). Generated by synthetic methods for defect prediction and multilingual code analysis."

🧠 Dataset Highlights

Languages Included: Python, Java, JavaScript, C, C++, Go, Rust
Records: 1,000 code snippets
Labels: defect (1 = buggy, 0 = clean)
Features:
- token_count: Total tokens (AST-based for Python)
- num_ifs, num_returns, num_func_calls: Code structure features
- ast_nodes: Number of nodes in the abstract syntax tree (Python only)
- lines_of_code & cyclomatic_complexity: Simulated metrics for modeling
📊 Columns Description

Column	Description
`function_name`	Unique identifier for the function
`code`	The actual function source code
`language`	Programming language used
`lines_of_code`	Approximate number of lines in the function
`cyclomatic_complexity`	Simulated measure of decision complexity
`defect`	1 = buggy, 0 = clean
`token_count`	Total token count (Python uses AST tokens)
`num_ifs`	Count of 'if' statements
`num_returns`	Count of 'return' statements
`num_func_calls`	Number of function calls
`ast_nodes`	AST node count (Python only, fallback = token count)

🛠️ Usage Examples

This dataset is suitable for:

Training traditional ML models like Random Forests or XGBoost
Evaluating prompt-based or fine-tuned LLMs (e.g., CodeT5, GPT-4)
Feature importance studies using AST and static code metrics
Cross-lingual transfer learning in code understanding

📎** License**

This dataset is synthetic and licensed under CC BY 4.0. Feel free to use, share, or adapt it with proper attribution.

J
Replication Data and Code for: 'Machine learning isotropic g values of...
data-legacy.fz-juelich.de
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jülich DATA (2024). Replication Data and Code for: 'Machine learning isotropic g values of radical polymers' [Dataset]. http://doi.org/10.26165/JUELICH-DATA/TOBXWP
Explore at:
xyz(7644), xyz(7421), xyz(7438), xyz(7562), xyz(7433), xyz(7070), xyz(7432), xyz(7465), xyz(7457), xyz(7431), xyz(7586), xyz(7572), xyz(7426), xyz(7590), xyz(7448), xyz(7452), xyz(7520), xyz(7406), xyz(7434), xyz(7122), xyz(7503), xyz(7450), xyz(7585), xyz(7475), xyz(7317), xyz(7428), xyz(7314), xyz(7587), xyz(7283), xyz(7423), xyz(7440), xyz(7414), xyz(7409), xyz(7468), xyz(7464), xyz(7399), xyz(7293), bin(371648), xyz(7643), xyz(7451), xyz(7439), xyz(7462), xyz(7127), xyz(7529), xyz(7474), application/x-ipynb+json(24454), xyz(7466), bin(1328), xyz(7043), xyz(7119), bin(2032), xyz(7582), text/x-python(2048), xyz(7441), xyz(7376), xyz(7220), xyz(7140), xyz(7404), bin(650678), xyz(10081), xyz(7481), xyz(7577), xyz(7453), xyz(7639), xyz(7625), xyz(7349), xyz(7063), xyz(7372), xyz(7413), xyz(7358), xyz(7580), xyz(7422), xyz(7201), xyz(7635), xyz(7342), xyz(7403), xyz(7100), xyz(7472), xyz(7673), xyz(10059), xyz(7672), xyz(7437), xyz(7402), bin(224), xyz(7592), xyz(7354), xyz(7371), xyz(7554), xyz(7436), xyz(7101), xyz(7615), bin(1936719), xyz(7103), xyz(7395), xyz(7510), xyz(7631), xyz(7405), xyz(7447), xyz(7353), xyz(7556), xyz(7325), xyz(7536), xyz(7444), xyz(7420), xyz(7469), xyz(7392), xyz(7312), xyz(7198), xyz(7531), xyz(7374), xyz(7446), xyz(7407), bin(894128), xyz(7647), bin(804728), xyz(7482), xyz(7183), xyz(7397), xyz(7624), xyz(7488), xyz(7073), xyz(7046), xyz(7456), xyz(7377), xyz(7435), bin(5702528), xyz(7463), xyz(7316), xyz(7454), bin(3240128), xyz(7524), bin(2755), xyz(7674), xyz(7386), xyz(7669), xyz(7381), xyz(7393), xyz(7494), xyz(7401), xyz(7380), xyz(7384), xyz(7506), xyz(7291), xyz(7370), xyz(7339), xyz(7335), xyz(7130), xyz(7366), xyz(7470), xyz(7459), xyz(7555), xyz(7473), txt(2985), bin(1062), xyz(7368), xyz(7269), xyz(7417), xyz(7099), xyz(7018), xyz(7343), xyz(7331), bin(1208), xyz(7280), xyz(7665), xyz(7068), xyz(7429), application/x-ipynb+json(3971), bin(708927), xyz(7516), xyz(7500), xyz(7461), xyz(7326), xyz(7209), xyz(7327), xyz(7087), xyz(7499), xyz(7390), xyz(7352), pdf(441670), xyz(7367), xyz(7642), xyz(7548), xyz(7412), xyz(7092), xyz(7389), xyz(7040), xyz(7564), xyz(7442), xyz(7443), xyz(7126), bin(4089), xyz(7398), xyz(7400), xyz(7534), xyz(7211), png(127746), xyz(7145), bin(8359328)Available download formats
Unique identifier
https://doi.org/10.26165/JUELICH-DATA/TOBXWP
Dataset updated
Mar 11, 2024
Dataset provided by
Jülich DATA
Dataset funded by
RWTH Aachen University
DFG
Description
This data repository contains the data sets and python scripts associated with the manuscript 'Machine learning isotropic g values of radical polymers '. Electron paramagnetic resonance measurements allow for obtaining experimental g values of radical polymers. Analogous to chemical shifts, g values give insight into the identity and environment of the paramagnetic center. In this work, Machine learning based prediction of g values is explored as a viable alternative to computationally expensive density functional theory (DFT) methods. Description of folder contents (switch to tree view): Datasets : Contains PTMA polymer structures from TR, TE-1, and TE-2 data sets transformed using a molecular descriptor (SOAP, MBTR or DAD) and corresponding DFT-calculated g values. Filenames contain 'PTMA_X' where X denotes the number of monomers which are radicals. Structure data sets have 'structure_data' in the title, DFT calculated g values have 'giso_DFT_data' in the title. The files are in .npy (NumPy) format. Models : ERT models trained on SOAP, MBTR and DAD feature vectors. Scripts : Contains scripts which can be used to predict g values from XYZ files of PTMA structures with 6 monomer units and varying radical density. The script 'prediction_functions.py' contains the functions which transform the XYZ coordinates into an appropriate feature vector which the trained model uses to predict. Description of individual functions are also given as docstrings (python documentation strings) in the code. The folder also contains additional files needed for the ERT-DAD model in .pkl format. XYZ_files : Contains atomic coordinates of PTMA structures in XYZ format. Two subfolders : WSD and TE-2 correspond to structures present in the whole structure data set and TE-2 test data set (see main text in the manuscript for details). Filenames in the folder 'XYZ_files/TE-2/PTMA-X/' are of the type 'chainlength_6ptma_Y'_Y''.xyz' where 'chainlength_6ptma' denotes the length of polymer chain (6 monomers), Y' denotes the proportion of monomers which are radicals (for instance, Y' = 50 means 3 out of 6 monomers are radicals) and Y'' denotes the order of the MD time frame. Actual time frame values of Y'' in ps is given in the manuscript. PTMA-ML.ipynb : Jupyter notebook detailing the workflow of generating the trained model. The file includes steps to load data sets, transform xyz files using molecular descriptors, optimise hyperparameters , train the model, cross validate using the training data set and evaluate the model. PTMA-ML.pdf : PTMA-ML.ipynb in PDF format. List of abbreviations : PTMA : poly(2,2,6,6-tetramethyl-1-piperidinyloxy-4-yl methacrylate) TR : Training data set TE-1 : Test data set 1 TE-2 : Test data set 2 ERT : Extremely randomized trees WSD : Whole structure data set SOAP : Smooth overlap of atomic orbitals MBTR : Many-body tensor representation DAD : Distances-Angles-Dihedrals
h
SPP_30K_reasoning_tasks
huggingface.co
Updated Aug 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farouk (2023). SPP_30K_reasoning_tasks [Dataset]. https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2023
Authors
Farouk
Description
Dataset Card for "SPP_30K_verified_tasks"

Dataset Summary

This is an augmented version of the Synthetic Python Problems(SPP) Dataset. This dataset has been generated from the subset of the data has been de-duplicated and verified using a Python interpreter. (SPP_30k_verified.jsonl). The original dataset contains small Python functions that include a docstring with a small description of what the function does and some calling examples for the function. The current… See the full description on the dataset page: https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks.
Z
Data from: Dataset for "Skyrmion states in thin confined polygonal...
data.niaid.nih.gov
eprints.soton.ac.uk
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pepper, Ryan Alexander (2020). Dataset for "Skyrmion states in thin confined polygonal nanostructures" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1066791
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Pepper, Ryan Alexander
Fangohr, Hans
Wang, Weiwei
Kluyver, Thomas
Albert, Maximilian
Carey, Rebecca
Beg, Marijan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides micromagnetic simulation data collected from a series of computational experiments on the effects of polygonal system shape on the energy of different magnetic states in FeGe. The data here form the results of the study ‘Skyrmion states in thin confined polygonal nanostructures.’

The dataset is split into several directories:

Data

square-samples and triangle-samples

These directories contain final state ‘relaxed’ magnetization fields for square and triangle samples respectively. The files within are organised into directories such that a sample of side length d = 40nm and which was subjected to an applied field of 500mT is labelled d40b500. Within each directory are twelve VTK unstructured grid format files (with file extension “.vtu”). These can be viewed in a variety of programmes; as of the time of writing we recommend either ParaView or MayaVi. The twelve files correspond to twelve simulations for each sample simulated, corresponding to twelve states from which each sample was relaxed - these are described in the paper which this dataset accompanies, but we note the labels are:

‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘h’, ‘u’, ‘r1’, ‘r2’, ‘r3’, ‘h2’, ‘h3’

where:

0 - 4 are incomplete to overcomplete skyrmions,

h, h2 and h3 are helical states with different periodicities

r1-r3 are different random states

u is the uniform magnetisation

The vtu files are labelled according to parameters used in the simulation. For example, a file labelled ‘160_10_3_0_u_wd000000.vtu’ encodes that:

The simulation was of a sample with side length 160nm.

The simulation was of a sample of thickness 10nm.

The maximum length of an edge in the finite element mesh of the sample was 3nm.

The system was relaxed from the ‘u’.

‘wd’ encodes that the simulation was performed with a full demagnetizing calculation.

square-npys and triangle-npys

These directories contain computed information about each of the final states stored in square-samples and triangle-samples. This information is stored in NumPy npz files, and can be read in Python straightforwardly using the function numpy.load. Within each npz file, there are 8 arrays, each with 12 elements. These arrays are:

‘E’ - corresponds to the total energy of the relaxed state.

‘E_exchange’ - corresponds to the Exchange energy of the relaxed state.

‘E_demag’ - corresponds to the Demagnetizing energy of the relaxed state.

‘E_dmi’ - corresponds to the Dzyaloshinskii-Moriya energy of the relaxed state.

‘E_zeeman’ - corresponds to the Zeeman energy of the relaxed state.

‘S’ - Calculated Skyrmion number of the relaxed state.

‘S_abs’ - Calculated absolute Skyrmion number - see paper for calculation details.

‘m_av’ - Computed normalised average magnetisation in x, y, and z directions for relaxed state

The twelve elements here correspond to the aforementioned twelve states relaxed from, and the ordering of the array is that of the order given above.

square-classified and triangle-classified

These directories contain a labelled dataset which gives details about what the final state in each simulation is. The files are stored as plain text, and are labelled with the following structure (the meanings of which are defined in the paper which this dataset accompanies):

iSk - Incomplete Skyrmion

Sk, or a number n followed by Sk - n Skyrmions in the state.

He - A helical state

Target - A target state.

The files contain the names of png files which are generated from the vtu files in the format ‘d_165b_350_2.png’. This example, if found in the ‘Sk.txt’ file, means that the sample which was 165nm in side length and which was relaxed under a field of 350mT from initial state 2 was found at equilibrium in a Skyrmion state.

Figures

square-pngs and triangle-pngs

These directories contain generated pngs from the vtu files. These are included for convenience as they take several hours to generate. Each directory contains three subdirectories:

all-states

This directory contains the simulation results from all samples, in the format ‘d_165b_350_2.png’, which means that the image contained here is that of the 165nm side length sample relaxed under a 350mT field from initial state 2.

ground-state

This directory contains the images which correspond to the lowest energy state found from all of the initial states. These are labelled as ‘d_180b_50.png’, such that the image contained in this file is the the lowest energy state found from all twelve simulations of the 180nm sidelength under a 50mT field.

uniform-state

This directory contains the images which correspond to the states relaxed only from the uniform state. These are labelled such that an image labelled ‘d_55b_100.png’ is the state found from relaxing a 180nm sample under a 100mT applied field.

phase-diagrams

These are the generated phase diagrams which are found in the paper.

scripts

This folder contains Python scripts which generate the png files mentioned above, and also the phase diagram figures for the paper this dataset accompanies. The scripts are labelled descriptively with what they do - for e.g. ’triangle-generate-png-all-states.py’ contains the script which loads vtu files and generates the png files. The exception here is ’render.py’ which provides functions used across multiple scripts. These scripts can be modified - for example; the function 'export_vector_field' has many options which can be adjusted to, for example, plot different components of the magnetization.

In order to run the scripts reproducibly, in the root directory we have provided a Makefile which builds each component. In order to reproduce the figures yourself, on a Linux system, ParaView must be installed. The Makefile has been tested on Ubuntu 16.04 with ParaView 5.0.1. In addition, a number of Python dependencies must also be installed. These are:

scipy >=0.19.1

numpy >= 1.11.0

matplotlib == 1.5.2

pillow>=3.1.2

We have included a requirements.txt file which specifies these dependencies; they can be installed by running 'pip install -r requirements.txt' from the directory.

Once all dependencies are installed, simply run the command ‘make’ from the shell to build the Docker image and generate the figures. Note the scripts will take a long time to run - at the time of writing the runtime will be on the order of several hours on a high-specification desktop machine. For convenience, we have therefore included the generated figures within the repository (as noted above). It should be noted that for the versions used in the paper, adjustments have been made after the generation of the figures, (for e.g. to add images of states within the metastability figure, and overlaying boundaries in the phase diagrams).

If you want to reproduce only the phase diagrams, and not the pngs, the command ‘make phase-diagrams’ will do so. This is the smallest part of the figure reproduction, and takes around 5 minutes on a high-specification desktop.
A global dataset of mesophyll conductance measurements and accompanying leaf...
figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jürgen Knauer; Matthias Cuntz; John R. Evans; Ülo Niinemets; Tiina Tosens; Linda-Liisa Veromann-Jürgenson; Christiane Werner; Sönke Zaehle (2023). A global dataset of mesophyll conductance measurements and accompanying leaf traits [Dataset]. http://doi.org/10.6084/m9.figshare.19681410.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19681410.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jürgen Knauer; Matthias Cuntz; John R. Evans; Ülo Niinemets; Tiina Tosens; Linda-Liisa Veromann-Jürgenson; Christiane Werner; Sönke Zaehle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a compilation of published measurements of leaf mesophyll conductance (gm) and accompanying leaf structural, anatomical, biochemical, and physiological traits as presented in: Knauer, J., Cuntz, M., Evans, J.R., Niinemets, Ü., Tosens, T., Veromann-Jürgenson, L.-L., Werner, C. and Zaehle, S. (2022), Contrasting anatomical and biochemical controls on mesophyll conductance across plant functional types, New Phytologist, doi:10.1111/nph.18363. Note that the compilation aims to represent unstressed, young, but fully expanded and high light-adapted leaves, albeit these criteria were not always explicitly stated. The reported measurements are assumed to be independent (i.e. from different set of plants) if one or several of species, cultivar/variety/genotype, population, measurement year (if annual/deciduous), age class (if woody), or growth environment is different. Only one (aggregated) gm value per set of plants is reported in the dataset. For further details on data collection and processing see reference above.

The file gm_dataset_Knauer_et_al_2022.xlsx contains the following sheets: - data: the main dataset including all variables and associated information such as species, measurement conditions, growing conditions etc. - columns_descriptions_units: a description of all columns in sheet 'data' as well as associated units (if applicable). - references: literature references of the data. Column 'refkey' can be used for cross-referencing with column 'refkey' in sheet 'data'. - references_methods: literature references for measurement methods of gm. Column 'refkey' can be used for cross-referencing with column 'method_reference' in sheet 'data'. - references_rubisco_parameters: literature references for rubisco parameters used in the studies. Column 'refkey' can be used for cross-referencing with columns 'Rubisco_constants_Ci_reference' and 'Rubisco_constants_Cc_reference' in sheet 'data'. The xlsx file can be imported into software environments such as R or python for further analysis. To read into R (tested for R version 4.1.2), a package such as readxl needs to be installed and loaded first, after which individual tabs can be imported using the read_xlsx() function. In python, the file can be imported using the pandas.read_excel command available from the pandas package.

The file aggregate_by_method.R provides code to read the xlsx dataset and aggregate by measurement method as described in the reference above.

For questions or comments please contact Dr. Jürgen Knauer (J.Knauer@westernsydney.edu.au).
Namoi groundwater uncertainty analysis
researchdata.edu.au
data.gov.au
Updated Dec 10, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2018). Namoi groundwater uncertainty analysis [Dataset]. https://researchdata.edu.au/namoi-groundwater-uncertainty-analysis/2986693
Explore at:
Dataset updated
Dec 10, 2018
Dataset provided by
Data.govhttps://data.gov/
Authors
Bioregional Assessment Program
Area covered
Namoi River
Description
Abstract

This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

The dataset contains the predictions of maximum drawdown and time to maximum drawdown at all groundwater model nodes in the Namoi subregion, constrained by the observations of groundwater level, river flux and mine water production rates. The dataset also contains the scripts required for and the results of the sensitivity analysis. The dataset contains all the scripts to generate these results from the outputs of the groundwater model (Namoi groundwater model dataset) and all the spreadsheets with the results. The methodology and results are described in Janardhanan et al. (2017)

References

Janardhanan S, Crosbie R, Pickett T, Cui T, Peeters L, Slatter E, Northey J, Merrin LE, Davies P, Miotlinski K, Schmid W and Herr A (2017) Groundwater numerical modelling for the Namoi subregion. Product 2.6.2 for the Namoi subregion from the Northern Inland Catchments Bioregional Assessment. Department of the Environment and Energy, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia., http://data.bioregionalassessments.gov.au/product/NIC/NAM/2.6.2.

Dataset History

The workflow that underpins this dataset is captured in 'NAM_MF_UA_workflow.png'.

Spreadsheet NAM_MF_dmax_Predictions_all.csv is sourced from dataset 'Namoi groundwater model' and contains the name, coordinates, Bore_ID in the model, layer number, the name of the objective function and the minimum, maximum, median, 5th percentile and 95th percentile of the design of experiment runs of maximum drawdown (dmax) for each groundwater model node. The individual results for each node for each run of the design of experiment is stored in spreadsheet 'NAM_MF_dmax_DoE_Predictions_all.csv' The equivalent files for time to maximum drawdown (tmax) are 'NAM_MF_tmax_Predictions_all.csv' and 'NAM_MF_tmax_DoE_Predictions_all.csv'.

These files are combined with the file 'NAM_MF_Observations_all.csv', which contains the observed values for groundwater levels, mine dewatering rates and river flux, and the files NAM_MF_dist_hobs.csv, NAM_MF_dist_rivers.csv, NAM_MF_dist_mines.csv, which contain the distances of the predictions to each mine, groundwater level observation and river, in python script 'NAM_MF_datawranling.csv'. This script selects only those predictions where the 95th percentile of dmax is less than 1 cm for further analysis. The subset of predictions is stored in 'NAM_MF_dmax_Predictions.csv','NAM_MF_tmax_Predictions.csv', 'NAM_MF_dmax_DoE_Predictions.csv','NAM_MF_tmax_DoE_Predictions.csv'. The output spreadsheet 'NAM_MF_Observations.csv' has the observations and the distances to the selected predictions.

As the simulated equivalents to the observations are part of the predictions dataset, these files are combined in python script NAM_MF_OFs.py to generate the objective function values for each run and each prediction. The objective function values are weighted sums of the residuals, stored in NAM_MF_DoE_hres.csv, NAM_MF_DoE_mres.csv, NAM_MF_DoE_rres.csv, according to the distance to the predictions and the results are stored in NAM_MF_DoE_OFh.csv, NAM_MF_DoE_OFm.csv, NAM_MF_DoE_OFr.csv. The threshold values for each objective function and prediction are stored in NAM_MF_OF_thresholds.csv. Python script NAM_MF_OF_wrangling.py further post-processes this information to generate the acceptance rates, saved in spreadsheet NAM_MF_dmax_Predictions_ARs.csv

Python script NAM_MF_CreatePosterior.py selects the results from the design of experiment run that satisfy the acceptance criteria. The results form the posterior predictive distributions stored in NAM_MF_dmax_Posterior.csv and NAM_MF_tmax_Posterior.csv. These are further summarised in NAM_MF_Predictions_summary.csv.

The sensitivity analysis is done with script NAM_MF_SI.py, which uses the results of the design of experiment together with the parameter values, stored in NAM_MF_DoE_Parameters.csv and their description (name, range, transform) in NAM_MF_Parameters.csv. The resulting sensitivity indices for dmax, tmax and river, head and minewater flow observations are stored in NAM_MF_SI_dmax.csv, NAM_MF_SI_tmax.csv, NAM_MF_SI_river.csv, NAM_MF_SI_mine.csv and NAM_MF_SI_head.csv. The intermediate files, ending in xxxx, are the results grouped per 100 predictions. The scripts NAM_MF_SI_collate.py and NAM_MF_SI_collate.slurm collate these.

Dataset Citation

Bioregional Assessment Programme (2017) Namoi groundwater uncertainty analysis. Bioregional Assessment Derived Dataset. Viewed 11 December 2018, http://data.bioregionalassessments.gov.au/dataset/36bd27e9-58d2-4bf2-8e4a-54b22ac98cfb.

Dataset Ancestors

Derived From NSW Office of Water GW licence extract linked to spatial locations NIC v2 (28 February 2014)

Derived From Namoi hydraulic conductivity measurements

Derived From Namoi NGIS Bore analysis for 2012

Derived From Namoi groundwater model alluvium extent

Derived From Surface Geology of Australia, 1:1 000 000 scale, 2012 edition

Derived From Namoi Leapfrog geological model

Derived From Historical Mining Footprints DTIRIS NAM 20150914

Derived From Gippsland Project boundary

Derived From Bioregional Assessment areas v04

Derived From Natural Resource Management (NRM) Regions 2010

Derived From Soil and Landscape Grid National Soil Attribute Maps - Clay 3 resolution - Release 1

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)

Derived From Bioregional_Assessment_Programme_Catchment Scale Land Use of Australia - 2014

Derived From GEODATA TOPO 250K Series 3

Derived From NSW Office of Water Groundwater Licence Extract NIC- Oct 2013

Derived From Geological Provinces - Full Extent

Derived From Bioregional Assessment areas v03

Derived From BOM, Australian Average Rainfall Data from 1961 to 1990

Derived From GIS analysis of HYDMEAS - Hydstra Groundwater Measurement Update: NSW Office of Water - Nov2013

Derived From NSW Catchment Management Authority Boundaries 20130917

Derived From Australian 0.05Âº gridded chloride deposition v2

Derived From Hydstra Groundwater Measurement Update - NSW Office of Water, Nov2013

Derived From Namoi dryland diffuse groundwater recharge

Derived From Bioregional Assessment areas v01

Derived From Bioregional Assessment areas v02

Derived From Namoi groundwater model

Derived From Namoi bore locations, depth to water for June 2012

Derived From NSW Office of Water Groundwater Entitlements Spatial Locations

Derived From Victoria - Seamless Geology 2014

Derived From Namoi NSW Office of Water groundwater licence BA purpose
Replication Package for "TriggerBench: A Performance Benchmark for...
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Scheuner; Joel Scheuner; Marcus Bertilsson; Oskar Gronqvist; Henrik Tao; Henrik Lagergren; Jan-Philipp Steghöfer; Jan-Philipp Steghöfer; Philipp Leitner; Philipp Leitner; Marcus Bertilsson; Oskar Gronqvist; Henrik Tao; Henrik Lagergren (2022). Replication Package for "TriggerBench: A Performance Benchmark for Serverless Function Triggers" [Dataset]. http://doi.org/10.5281/zenodo.6491259
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6491259
Dataset updated
Jul 26, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Scheuner; Joel Scheuner; Marcus Bertilsson; Oskar Gronqvist; Henrik Tao; Henrik Lagergren; Jan-Philipp Steghöfer; Jan-Philipp Steghöfer; Philipp Leitner; Philipp Leitner; Marcus Bertilsson; Oskar Gronqvist; Henrik Tao; Henrik Lagergren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains the code (`aws-triggers` and `azure-trigger`), data analysis scripts (`data-analysis`), and dataset (`data`) of the TriggerBench cross-provider serverless benchmark.

It also bundles a customized extension of the `serverless-benchmarker` tool to automate and analyze serverless performance experiments.

TriggerBench

The Github repository joe4dev/trigger-bench contains the last version of TriggerBench. This replication package describes the version for the paper "TriggerBench: A Performance Benchmark for Serverless Function Triggers".

TriggerBench currently supports three triggers on AWS and eight triggers on Microsoft Azure.

Dataset

The `data/aws` and `data/azure` directories contain data from benchmark executions from April 2022.

Each execution is a separate directory with a timestamp in the format `yyyy-mm-dd-HH-MM-SS` (e.g., `2022-04-15_21-58-52`) and contains the following files:

`k6_metrics.csv`: Load generator HTTP client logs in CSV format (see [K6 docs](https://k6.io/docs/results-visualization/csv/))

`sb_config.yml`: serverless benchmarker execution configuration including experiment label.

`trigger.csv`: analyzer output CSV per trace.

`root_trace_id`: The trace id created by k6 and adopted by the invoker function

`child_trace_id`: The trace id newly created by the receiver function if trace propagation is not supported (this is the case for most asynchronous triggers)

`t1`-`t4`: Timestamps following the trace model (see paper)

`t5`-`t9`: Additional timestamps for measuring timestamping overhead

`coldstart_f1=True|False`: coldstart status for invoker (f1) and receiver (f2) functions

`trace_ids.txt`: text file with each pair of `root_trace_id` and `child_trace_id` on a new line.

`traces.json`: raw trace JSON representation as retrieved from the provider tracing service. For AWS, see [X-Ray segment docs](https://docs.aws.amazon.com/xray/latest/devguide/xray-api-segmentdocuments.html). For Azure, see [Application Insights telemetry data model](https://docs.microsoft.com/en-us/azure/azure-monitor/app/data-model).

`workload_options.json`: [K6 load scenario](https://k6.io/docs/using-k6/scenarios/) configuration.

Replicate Data Analysis

Installation

1. Install [Python](https://www.python.org/downloads/) 3.10+

2. Install Python dependencies `pip install -r requirements.txt`

Create Plots

1. Run `python plots.py` generates the plots and the statistical summaries presented in the paper.

By default, the plots will be saved into a `plots` sub-directory.
An alternative output directory can be configured through the environment variable `PLOTS_PATH`.

> Hint: For interactive development, we recommend the VSCode [Python extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python) in [interactive mode](https://youtu.be/lwN4-W1WR84?t=107).

Replicate Cloud Experiments

The following experiment plan automates benchmarking experiments with different types workloads (constant and bursty).

This generates a new dataset in the same format as described above.

Set up a load generator as vantage point following the description in [LOADGENERATOR](./serverless-benchmarker/docs/LOADGENERATOR.md).

Choose the `PROVIDER` (aws or azure) in the [constant.py](./experiment-plans/constant.py) experiment plan

Run the [constant.py](./experiment-plans/constant.py) experiment plan

Open tmux

Activate virtualenv `source sb-env/bin/activate`

Run `./constant.py 2>&1 | tee -a constant.log`
H
Replication Data for: Beating the spectroscopic Rayleigh limit via...
dataverse.harvard.edu
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiktor Krokosz; Mateusz Mazelanik; Michał Lipka; Marcin Jarzyna; Wojciech Wasilewski; Konrad Banaszek; Michał Parniak (2023). Replication Data for: Beating the spectroscopic Rayleigh limit via post-processed heterodyne detection [Dataset]. http://doi.org/10.7910/DVN/F4LRZR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/F4LRZR
Dataset updated
Nov 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Wiktor Krokosz; Mateusz Mazelanik; Michał Lipka; Marcin Jarzyna; Wojciech Wasilewski; Konrad Banaszek; Michał Parniak
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Normalized variances calculated using the method described in the article, based on experimental data. Data is stored using Xarray, specifically in the NetCDF format. Data can be easily accessed using the Xarray Python library, specifically by calling xarray.open_dataset() The dataset is structured as follows: two N-dimensional DataArrays, one corresponding for calculations with time displacements (labeled as time) and one for calculations with phase displacements with the time centroid already picked (labeled as final) each DataArray has 5 dimensions: SNR, eps (separation), ph_disp/disp (displacement), sample/sample_time (bootstrapped sample), supersample (ensemble of bootstrapped samples) coordinates label the parameters along each dimension Usage examples Opening the dataset import numpy as np import xarray as xr variances = xr.open_dataset("coherent.nc") Obtaining parameter estimates def get_centroid_indices(variances): return np.bincount( variances.argmin( dim="disp" if "disp" in variances.dims else "ph_disp" ).values.flatten() ) def get_centroid_index(variances): return np.argmax(get_centroid_indices(variances)) def epsilon_estimator(eps): return 4 * np.sqrt(np.clip(var, 0, None)) time_centroid_estimates = variances["time"].idxmin(dim="disp") phase_centroid_estimates = variances["final"].idxmin(dim="ph_disp") epsilon_estimates = eps_estimator( variances["final"].isel(ph_disp=common.get_centroid_index(variances["final"])) ) Calculating and plotting precision def plot(estimates): estimator_variances = estimates.var( dim="sample" if "sample" in estimates.dims else "sample_time" ) precision = ( 1.0 / estimator_variances.snr / variances.attrs["SAMPLE_SIZE"] / estimator_variances ) precision = precision.where(xr.apply_ufunc(np.isfinite, precision), other=0) mean_precision = precision.mean(dim="supersample") mean_precision = mean_precision.where(np.isfinite(mean_precision), 0) precision_error = 2 * precision.std(dim="supersample").fillna(0) g = mean_precision.plot.scatter( x="eps", col="snr", col_wrap=2, sharex=True, sharey=True, ) for ax, snr in zip(g.axs.flat, snrs): ax.errorbar( precision.eps.values, mean_precision.sel(snr=snr), yerr=precision_error.sel(snr=snr), fmt="o", ) plot(time_centroid_estimates) plot(phase_centroid_estimates) plot(epsilon_estimates)
Hessian QM9 Dataset
figshare.com
bin
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Williams (2024). Hessian QM9 Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.26363959.v4
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26363959.v4
Dataset updated
Dec 12, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nicholas Williams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
OverviewHessian QM9 is the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $\omega$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as in water, tetrahydrofuran, and toluene using an implicit solvation model.A pre-print article associated with this dataset is available at here.Data recordsThe dataset is stored in Hugging Face's dataset format. For each of the four implicit solvent environments (vacuum, THF, toluene, and water), the data is divided into separate datasets containing vibrational analysis of 41,645 optimized geometries. Labels are associated with the QM9 molecule labelling system given by Ramakrishnan et al.Please note that only molecules containing H, C, N, O were considered. This exclusion was due to the limited number of molecules containing fluorine in the QM9 dataset, which was not sufficient to build a good description of the chemical environment for fluorine atoms. Including these molecules may have reduced the overall precision of any models trained on our data.Load the dataset:Use the following Python script to load the dataset dictionary: pythonfrom datasets import load_from_diskdataset = load_from_disk(root_directory)print(dataset)Expected output:pythonDatasetDict({vacuum: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),thf: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),toluene: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),water: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645})})DFT MethodsAll DFT calculations were carried out using the NWChem software package. The density functional used was $\omega$B97x with a 6-31G* basis set to create data compatible with the ANI-1/ANI-1x/ANI-2x datasets. The self-consistent field (SCF) cycle was converged when changes in total energy and density were less than 1e-6 eV. All molecules in the set are neutral with a multiplicity of 1. The Mura-Knowles radial quadrature and Lebedev angular quadrature were used in the integration. Structures were optimized in vacuum and three solvents (tetrahydrofuran, toluene, and water) using an implicit solvation model.The Hessian matrices, vibrational frequencies, and normal modes were computed for a subset of 41,645 molecular geometries using the finite differences method.Example model weightsAn example model trained on Hessian data is included in this dataset. Full details of the model will be provided in an upcoming publication. The model is an E(3)-equivariant graph neural network using the e3x package with specific architecture details. To load the model weights, use:pythonparams = jnp.load('params_train_f128_i5_b16.npz', allow_pickle=True)['params'].item()
Data from: FISBe: A real-world benchmark dataset for instance segmentation...
zenodo.org
data.niaid.nih.gov
bin, json +3
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. http://doi.org/10.5281/zenodo.10875063
Explore at:
zip, text/x-python, bin, json, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10875063
Dataset updated
Apr 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 26, 2024
Description
General

For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

Summary

A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

30 completely labeled (segmented) images

71 partly labeled images

altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

A set of metrics and a novel ranking score for respective meaningful method benchmarking

An evaluation of three baseline methods in terms of the above metrics and score

Abstract

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

Dataset documentation:

We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

>> FISBe Datasheet

Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

Files

fisbe_v1.0_{completely,partly}.zip

contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

fisbe_v1.0_mips.zip

maximum intensity projections of all samples, for convenience.

sample_list_per_split.txt

a simple list of all samples and the subset they are in, for convenience.

view_data.py

a simple python script to visualize samples, see below for more information on how to use it.

dim_neurons_val_and_test_sets.json

a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

Readme.md

general information

How to work with the image files

Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.

We recommend to work in a virtual environment, e.g., by using conda:

conda create -y -n flylight-env -c conda-forge python=3.9
conda activate flylight-env

How to open zarr files

Install the python zarr package:
pip install zarr

Opened a zarr file with:

import zarr
raw = zarr.open(
seg = zarr.open(

# optional:
import numpy as np
raw_np = np.array(raw)

Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.

How to view zarr image files

We recommend to use napari to view the image data.

Install napari:
pip install "napari[all]"

Save the following Python script:

import zarr, sys, napari

raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

viewer = napari.Viewer(ndisplay=3)
for idx, gt in enumerate(gts):
viewer.add_labels(
gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
napari.run()

Execute:
python view_data.py

Metrics

S: Average of avF1 and C

avF1: Average F1 Score

C: Average ground truth coverage

clDice_TP: Average true positives clDice

FS: Number of false splits

FM: Number of false merges

tp: Relative number of true positives

For more information on our selected metrics and formal definitions please see our paper.

Baseline

To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.

License

The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

If you use FISBe in your research, please use the following BibTeX entry:

@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

Acknowledgments

We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.

Changelog

There have been no changes to the dataset so far.
All future change will be listed on the changelog page.

Contributing

If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

All contributions are welcome!
s
Data from: Real-time single-molecule 3D tracking in E. coli based on...
figshare.scilifelab.se
datasetcatalog.nlm.nih.gov
+1more
bin
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elias Amselem; Bo Broadwater; Tora Hävermark; Magnus Johansson; Johan Elf (2025). Real-time single-molecule 3D tracking in E. coli based on cross-entropy minimization [Dataset]. http://doi.org/10.17044/scilifelab.21602844.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.21602844.v1
Dataset updated
Jan 15, 2025
Dataset provided by
Uppsala University
Authors
Elias Amselem; Bo Broadwater; Tora Hävermark; Magnus Johansson; Johan Elf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General description This item contains all the data, code and analysis objects used in the paper: "Real-time single-molecule 3D tracking in E. coli based on cross-entropy minimization"

Elias Amselem*, Bo Broadwater*, Tora Hävermark, Magnus Johansson & Johan Elf. Dept. Cell and Molecular Biology, Uppsala University, Sweden" *Equal contribution

We present a 3D tracking principle that approaches the sub-ms regime. The method is based on the true excitation point spread function and cross-entropy minimization for position localization of moving fluorescent reporters. Our implementation also features a new method for microsecond 3D point spread function positioning and a new estimator for diffusion analysis of tracking data. We successfully applied these methods to track the Trigger Factor protein in living bacterial cells.

Experimental data description

The data provided in this repository is generated by the microscope described in the publication mentioned above. The underlying real-time tracking principle and methods used are outlined and evaluated using the code base and data found in this repository. This includes; the trajectory reconstruction based on the cross-entropy minimization, the extended covariance estimator (ECVE) for diffusion, simulation for evaluating both trajectory reconstruction and ECVE method, and the Trigger Factor live cell E. coli data with analysis.

Each entry includes the raw data with analysis code. In each entry under the folder "TriggerFactor_Code\ProjectMain" is the main analysis python file, this file includes instructions on how to run the script. Also, in the folder there are preconfigured main files to generate data used in the manuscript, this is also available in the ScourceData.zip file.

To understand how to use the code and the structure used please see entry: 20220722_EXP-22-BL9428_Example_Analysis and the README.txt file included. Here you find the python requirements (dependencies and versions), and instructions on how to run simulations.

Facebook

Twitter

Click to copy link

Link copied

Cite

pxy (2024). Magicoder-Evol-Instruct-110K-python [Dataset]. https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python

Magicoder-Evol-Instruct-110K-python

pxyyy/Magicoder-Evol-Instruct-110K-python

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 17, 2024

Authors

pxy

Description

Dataset Card for "Magicoder-Evol-Instruct-110K-python"

from datasets import load_dataset

Load your dataset

dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split

Define a filter function

def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search

… See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.

Clear search

Close search

Google apps

Main menu

Magicoder-Evol-Instruct-110K-python

Load your dataset

Define a filter function

… See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.

Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A...

SELTO Dataset

python-treesitter-filtered-datasetsV2

Data from: Satellite remote sensing dataset of Sentinel-2 for phenology...

Data from: Qualitative Mineral Potential Map of Tungsten Skarn in the...

Dataset for article entitled "An empirical evaluation of methodologies used...

Spiking Seizure Classification Dataset

Comments

SStart: 180 //Start of the Seizure in signal time# SEnd: 276.0 //Start of the Seizure in signal time# Pid: 2 // The patient ID as per the SWEC-ETHZ iEEG Database # Sid: 1 // The Seizure ID as per the SWEC-ETHZ iEEG Database # Channel_No: 1 // The channel number

Data from: BaNDyT: Bayesian Network Modeling of Molecular Dynamics...

Software Defects Dataset 1k

Replication Data and Code for: 'Machine learning isotropic g values of...

SPP_30K_reasoning_tasks

Data from: Dataset for "Skyrmion states in thin confined polygonal...

A global dataset of mesophyll conductance measurements and accompanying leaf...

Namoi groundwater uncertainty analysis

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

Replication Package for "TriggerBench: A Performance Benchmark for...

Replication Data for: Beating the spectroscopic Rayleigh limit via...

Hessian QM9 Dataset

Data from: FISBe: A real-world benchmark dataset for instance segmentation...

General

Summary

Abstract

Dataset documentation:

Files

How to work with the image files

How to open zarr files

How to view zarr image files

Metrics

Baseline

License

Citation

Acknowledgments

Changelog

Contributing

Data from: Real-time single-molecule 3D tracking in E. coli based on...

Magicoder-Evol-Instruct-110K-python

pxyyy/Magicoder-Evol-Instruct-110K-python

Load your dataset

Define a filter function

… See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.