Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Biometrics is the process of measuring and analyzing human characteristics to verify a given person's identity. Most real-world applications rely on unique human traits such as fingerprints or iris. However, among these unique human characteristics for biometrics, the use of Electroencephalogram (EEG) stands out given its high inter-subject variability. Recent advances in Deep Learning and a deeper understanding of EEG processing methods have led to the development of models that accurately discriminate unique individuals. However, it is still uncertain how much EEG data is required to train such models. This work aims at determining the minimal amount of training data required to develop a robust EEG-based biometric model (+95% and +99% testing accuracies) from a subject for a task-dependent task. This goal is achieved by performing and analyzing 11,780 combinations of training sizes, by employing various neural network-based learning techniques of increasing complexity, and feature extraction methods on the affective EEG-based DEAP dataset. Findings suggest that if Power Spectral Density or Wavelet Energy features are extracted from the artifact-free EEG signal, 1 and 3 s of data per subject is enough to achieve +95% and +99% accuracy, respectively. These findings contributes to the body of knowledge by paving a way for the application of EEG to real-world ecological biometric applications and by demonstrating methods to learn the minimal amount of data required for such applications.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Community science image libraries offer a massive, but largely untapped, source of observational data for phenological research. The iNaturalist platform offers a particularly rich archive, containing more than 49 million verifiable, georeferenced, open access images, encompassing seven continents and over 278,000 species. A critical limitation preventing scientists from taking full advantage of this rich data source is labor. Each image must be manually inspected and categorized by phenophase, which is both time-intensive and costly. Consequently, researchers may only be able to use a subset of the total number of images available in the database. While iNaturalist has the potential to yield enough data for high-resolution and spatially extensive studies, it requires more efficient tools for phenological data extraction. A promising solution is automation of the image annotation process using deep learning. Recent innovations in deep learning have made these open-source tools accessible to a general research audience. However, it is unknown whether deep learning tools can accurately and efficiently annotate phenophases in community science images. Here, we train a convolutional neural network (CNN) to annotate images of Alliaria petiolata into distinct phenophases from iNaturalist and compare the performance of the model with non-expert human annotators. We demonstrate that researchers can successfully employ deep learning techniques to extract phenological information from community science images. A CNN classified two-stage phenology (flowering and non-flowering) with 95.9% accuracy and classified four-stage phenology (vegetative, budding, flowering, and fruiting) with 86.4% accuracy. The overall accuracy of the CNN did not differ from humans (p = 0.383), although performance varied across phenophases. We found that a primary challenge of using deep learning for image annotation was not related to the model itself, but instead in the quality of the community science images. Up to 4% of A. petiolata images in iNaturalist were taken from an improper distance, were physically manipulated, or were digitally altered, which limited both human and machine annotators in accurately classifying phenology. Thus, we provide a list of photography guidelines that could be included in community science platforms to inform community scientists in the best practices for creating images that facilitate phenological analysis.
Methods Creating a training and validation image set
We downloaded 40,761 research-grade observations of A. petiolata from iNaturalist, ranging from 1995 to 2020. Observations on the iNaturalist platform are considered “research-grade if the observation is verifiable (includes image), includes the date and location observed, is growing wild (i.e. not cultivated), and at least two-thirds of community users agree on the species identification. From this dataset, we used a subset of images for model training. The total number of observations in the iNaturalist dataset are heavily skewed towards more recent years. Less than 5% of the images we downloaded (n=1,790) were uploaded between 1995-2016, while over 50% of the images were uploaded in 2020. To mitigate temporal bias, we used all available images between the years 1995 and 2016 and we randomly selected images uploaded between 2017-2020. We restricted the number of randomly-selected images in 2020 by capping the number of 2020 images to approximately the number of 2019 observations in the training set. The annotated observation records are available in the supplement (supplementary data sheet 1). The majority of the unprocessed records (those which hold a CC-BY-NC license) are also available on GBIF.org (2021).
One of us (R. Reeb) annotated the phenology of training and validation set images using two different classification schemes: two-stage (non-flowering, flowering) and four-stage (vegetative, budding, flowering, fruiting). For the two-stage scheme, we classified 12,277 images and designated images as ‘flowering’ if there was one or more open flowers on the plant. All other images were classified as non-flowering. For the four-stage scheme, we classified 12,758 images. We classified images as ‘vegetative’ if no reproductive parts were present, ‘budding’ if one or more unopened flower buds were present, ‘flowering’ if at least one opened flower was present, and ‘fruiting’ if at least one fully-formed fruit was present (with no remaining flower petals attached at the base). Phenology categories were discrete; if there was more than one type of reproductive organ on the plant, the image was labeled based on the latest phenophase (e.g. if both flowers and fruits were present, the image was classified as fruiting).
For both classification schemes, we only included images in the model training and validation dataset if the image contained one or more plants with clearly visible reproductive parts were clear and we could exclude the possibility of a later phenophase. We removed 1.6% of images from the two-stage dataset that did not meet this requirement, leaving us with a total of 12,077 images, and 4.0% of the images from the four-stage leaving us with a total of 12,237 images. We then split the two-stage and four-stage datasets into a model training dataset (80% of each dataset) and a validation dataset (20% of each dataset).
Training a two-stage and four-stage CNN
We adapted techniques from studies applying machine learning to herbarium specimens for use with community science images (Lorieul et al. 2019; Pearson et al. 2020). We used transfer learning to speed up training of the model and reduce the size requirements for our labeled dataset. This approach uses a model that has been pre-trained using a large dataset and so is already competent at basic tasks such as detecting lines and shapes in images. We trained a neural network (ResNet-18) using the Pytorch machine learning library (Psake et al. 2019) within Python. We chose the ResNet-18 neural network because it had fewer convolutional layers and thus was less computationally intensive than pre-trained neural networks with more layers. In early testing we reached desired accuracy with the two-stage model using ResNet-18. ResNet-18 was pre-trained using the ImageNet dataset, which has 1,281,167 images for training (Deng et al. 2009). We utilized default parameters for batch size (4), learning rate (0.001), optimizer (stochastic gradient descent), and loss function (cross entropy loss). Because this led to satisfactory performance, we did not further investigate hyperparameters.
Because the ImageNet dataset has 1,000 classes while our data was labeled with either 2 or 4 classes, we replaced the final fully-connected layer of the ResNet-18 architecture with fully-connected layers containing an output size of 2 for the 2-class problem and 4 for the 4-class problem. We resized and cropped the images to fit ResNet’s input size of 224x224 pixels and normalized the distribution of the RGB values in each image to a mean of zero and a standard deviation of one, to simplify model calculations. During training, the CNN makes predictions on the labeled data from the training set and calculates a loss parameter that quantifies the model’s inaccuracy. The slope of the loss in relation to model parameters is found and then the model parameters are updated to minimize the loss value. After this training step, model performance is estimated by making predictions on the validation dataset. The model is not updated during this process, so that the validation data remains ‘unseen’ by the model (Rawat and Wang 2017; Tetko et al. 1995). This cycle is repeated until the desired level of accuracy is reached. We trained our model for 25 of these cycles, or epochs. We stopped training at 25 epochs to prevent overfitting, where the model becomes trained too specifically for the training images and begins to lose accuracy on images in the validation dataset (Tetko et al. 1995).
We evaluated model accuracy and created confusion matrices using the model’s predictions on the labeled validation data. This allowed us to evaluate the model’s accuracy and which specific categories are the most difficult for the model to distinguish. For using the model to make phenology predictions on the full, 40,761 image dataset, we created a custom dataloader function in Pytorch using the Custom Dataset function, which would allow for loading images listed in a csv and passing them through the model associated with unique image IDs.
Hardware information
Model training was conducted using a personal laptop (Ryzen 5 3500U cpu and 8 GB of memory) and a desktop computer (Ryzen 5 3600 cpu, NVIDIA RTX 3070 GPU and 16 GB of memory).
Comparing CNN accuracy to human annotation accuracy
We compared the accuracy of the trained CNN to the accuracy of seven inexperienced human scorers annotating a random subsample of 250 images from the full, 40,761 image dataset. An expert annotator (R. Reeb, who has over a year’s experience in annotating A. petiolata phenology) first classified the subsample images using the four-stage phenology classification scheme (vegetative, budding, flowering, fruiting). Nine images could not be classified for phenology and were removed. Next, seven non-expert annotators classified the 241 subsample images using an identical protocol. This group represented a variety of different levels of familiarity with A. petiolata phenology, ranging from no research experience to extensive research experience (two or more years working with this species). However, no one in the group had substantial experience classifying community science images and all were naïve to the four-stage phenology scoring protocol. The trained CNN was also used to classify the subsample images. We compared human annotation accuracy in each phenophase to the accuracy of the CNN using students
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global silicon photonic optical neural network chip market size reached USD 1.05 billion in 2024, exhibiting robust expansion driven by the surging demand for high-speed and energy-efficient computing solutions. The market is projected to grow at a remarkable CAGR of 25.8% from 2025 to 2033, reaching a forecasted value of USD 8.5 billion by 2033. This impressive growth trajectory is primarily attributed to the increasing adoption of artificial intelligence (AI) and high-performance computing (HPC) applications, which require advanced data processing capabilities and ultra-fast communication networks. As per the latest research, the convergence of silicon photonics and optical neural networks is revolutionizing the semiconductor industry, enabling next-generation computational architectures that promise unparalleled speed, scalability, and energy efficiency.
One of the primary growth factors fueling the silicon photonic optical neural network chip market is the exponential rise in data generation and the corresponding need for accelerated data processing. The proliferation of AI-driven applications, such as deep learning, computer vision, and natural language processing, demands computational platforms that can process vast volumes of data with minimal latency and reduced power consumption. Silicon photonics technology, by leveraging light for data transmission and computation, offers significant advantages over traditional electronic approaches, including higher bandwidth, lower signal loss, and improved thermal management. This has made silicon photonic optical neural network chips an attractive solution for data centers, cloud computing providers, and enterprises seeking to optimize their AI and HPC workloads.
Another critical driver for market growth is the ongoing technological advancements in photonic integration and chip manufacturing processes. Leading semiconductor manufacturers and research institutions are investing heavily in the development of monolithic and hybrid integration techniques, enabling the seamless incorporation of optical components such as transceivers, modulators, detectors, and waveguides onto a single silicon substrate. These innovations have resulted in compact, scalable, and cost-effective silicon photonic chips that can be mass-produced with high yield and reliability. The integration of photonic and electronic elements on the same chip not only enhances performance but also reduces the overall system footprint, making these chips ideal for deployment in space-constrained environments such as edge devices and mobile platforms.
Furthermore, the silicon photonic optical neural network chip market is witnessing significant traction from the healthcare and automotive sectors, where real-time data processing and low-latency communication are critical. In healthcare, these chips are being leveraged for advanced medical imaging, genomics, and diagnostics, enabling faster and more accurate analysis of complex datasets. In the automotive industry, the growing adoption of autonomous vehicles and advanced driver-assistance systems (ADAS) is driving the need for high-speed, low-power computing solutions capable of processing sensor data in real time. The versatility and performance benefits of silicon photonic chips are thus opening new avenues for innovation across a diverse range of applications, further propelling market growth.
From a regional perspective, North America currently dominates the global market, accounting for the largest revenue share in 2024, followed closely by Asia Pacific and Europe. The presence of leading technology companies, well-established research and development infrastructure, and robust investment in AI and HPC initiatives have positioned North America as a frontrunner in the adoption of silicon photonic optical neural network chips. Asia Pacific, on the other hand, is emerging as a high-growth region, driven by rapid industrialization, increasing data center deployments, and government initiatives to promote advanced semiconductor technologies. Europe is also witnessing steady growth, supported by strong collaborations between academia and industry, as well as a growing focus on digital transformation across key sectors. The Middle East & Africa and Latin America, while currently representing smaller market shares, are expected to experience accelerated growth over the forecast period, fueled by rising investments in digital infrastructure and smart technol
Facebook
TwitterThis data release contains the model inputs, outputs, and source code (written in R) for the boosted regression tree (BRT) and artificial neural network (ANN) models developed for four sites in Upper Klamath Lake which were used to simulate daily maximum pH and daily minimum dissolved oxygen (DO) from May 18th to October 4th in 2005-12 and 2015-19 at four sites, and to evaluate variable effects and their importance. Simulations were not developed for 2013 and 2014 due to a large amount of missing meteorological data. The sites included: 1) Williamson River (WMR), which was located in the northern portion of the lake near the mouth of the Williamson River and had a depth between 0.7 and 2.9 meters; 2) Rattlesnake Point (RPT), which was located near the southern portion of the lake and had a depth between 1.9 and 3.4 meters; 3) Mid-North (MDN), which was located in the northwest portion of the lake and a depth between 2.4 and 4.2 meters; 4) Mid-Trench (MDT) , which was located in the trench that runs along the western portion of the lake and had a depth between 13.2 and 15 meters.
Facebook
Twitter
According to our latest research, the global Optical Neural Network Chip market size reached USD 1.2 billion in 2024, driven by rapid advancements in artificial intelligence hardware and increasing demand for high-speed data processing. The market is exhibiting a robust CAGR of 34.7% and is expected to reach a forecasted value of USD 16.5 billion by 2033. This extraordinary growth is primarily attributed to the surge in AI workloads, the proliferation of edge computing, and the ongoing shift toward photonic-based data transmission and computation, which offer significant speed and energy efficiency advantages over traditional electronic counterparts.
The primary growth factor fueling the Optical Neural Network Chip market is the exponential increase in data generation and the corresponding need for faster, more efficient data processing. As industries like healthcare, finance, and telecommunications increasingly rely on AI-driven analytics, the limitations of conventional electronic chips have become apparent. Optical neural network chips, leveraging the principles of photonics, enable parallel data processing at the speed of light, drastically reducing latency and power consumption. This makes them especially suitable for applications such as real-time image recognition, natural language processing, and large-scale machine learning tasks. The ability to handle massive datasets with minimal heat generation and energy loss is a compelling advantage that is accelerating the adoption of optical neural network chips across various sectors.
Another significant driver is the evolution of data center architectures. With the rise of cloud computing and the Internet of Things (IoT), data centers are under immense pressure to deliver higher performance while minimizing energy costs. Optical neural network chips are emerging as a transformative solution, enabling data centers to achieve unprecedented processing speeds and scalability. The integration of photonic components, such as silicon photonics and photonic integrated circuits, allows for seamless interconnectivity and efficient data transmission within and between servers. This not only enhances computational throughput but also supports the growing trend of distributed AI processing, paving the way for more sophisticated and responsive AI services.
Furthermore, the push toward edge computing is reshaping the landscape of AI hardware deployment. As devices at the edge—ranging from autonomous vehicles to smart medical devices—require real-time decision-making capabilities, the limitations of traditional chips become a bottleneck. Optical neural network chips, with their compact form factors and ultra-low latency, are ideally positioned to meet these demands. Their integration into edge devices enables faster inference, reduced dependence on centralized cloud resources, and improved user experiences. The convergence of optical technologies with AI at the edge is expected to unlock new business models and applications, further propelling market growth.
Regionally, North America currently dominates the Optical Neural Network Chip market, accounting for the largest share of global revenue in 2024. This leadership is underpinned by significant investments in AI research, a robust ecosystem of semiconductor and photonics companies, and strong demand from sectors such as IT & telecommunications and healthcare. Meanwhile, the Asia Pacific region is witnessing the fastest growth, driven by rapid digital transformation, government initiatives supporting AI and photonics research, and the expansion of data center infrastructure in countries like China, Japan, and South Korea. Europe is also making notable strides, particularly in automotive and industrial automation applications, supported by a strong focus on innovation and sustainability. As the market matures, collaboration across regions and industries will be crucial in shaping the future trajectory of optical neural network chip adoption.
In the realm of AI hardware, the development of the Photonic Neural Network Accelerator Card is a significant milestone. This technology is designed to harness the power of light for data processing, offering unprecedented speed and efficiency. By integrating photonic components, these accelerator cards can handle
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Deep learning has emerged as a robust tool for automating feature extraction from 3D images, offering an efficient alternative to labour-intensive and potentially biased manual image segmentation methods. However, there has been limited exploration into the optimal training set sizes, including assessing whether artificial expansion by data augmentation can achieve consistent results in less time and how consistent these benefits are across different types of traits. In this study, we manually segmented 50 planktonic foraminifera specimens from the genus Menardella to determine the minimum number of training images required to produce accurate volumetric and shape data from internal and external structures. The results reveal unsurprisingly that deep learning models improve with a larger number of training images with eight specimens being required to achieve 95% accuracy. Furthermore, data augmentation can enhance network accuracy by up to 8.0%. Notably, predicting both volumetric and shape measurements for the internal structure poses a greater challenge compared to the external structure, due to low contrast differences between different materials and increased geometric complexity. These results provide novel insight into optimal training set sizes for precise image segmentation of diverse traits and highlight the potential of data augmentation for enhancing multivariate feature extraction from 3D images. Methods Data collection 50 planktonic foraminifera, comprising 4 Menardella menardii, 17 Menardella limbata, 18 Menardella exilis, and 11 Menardella pertenuis specimens, were used in our analyses (electronic supplementary material, figures S1 and S2). The taxonomic classification of these species was established based on the analysis of morphological characteristics observed in their shells. In this context, all species are characterised by lenticular, low trochosprial tests with a prominent keel [13]. Discrimination among these species is achievable, as M. limbata can be distinguished from its ancestor, M. menardii, by having a greater number of chambers and a smaller umbilicus. Moreover, M. exilis and M. pertenuis can be discerned from M. limbata by their thinner, more polished tests and reduced trochospirality. Furthermore, M. pertenuis is identifiable by a thin plate extending over the umbilicus and possessing a greater number of chambers in the final whorl compared to M. exilis [13]. The samples containing these individuals and species spanned 5.65 million years ago (Ma) to 2.85 Ma [14] and were collected from the Ceara Rise in the Equatorial Atlantic region at Ocean Drilling Program (ODP) Site 925, which comprised Hole 925B (4°12.248'N, 43°29.349'W), Hole 925C 20 (4°12.256'N, 43°29.349'W), and Hole 925D (4°12.260'N, 43°29.363'W). See Curry et al., [15] for more details. This group was chosen to provide inter- and intraspecific species variation, and to provide contemporary data to test how morphological distinctiveness maps to taxonomic hypotheses [16]. The non-destructive imaging of both internal and external structures of the foraminifera was conducted at the µ-VIS X-ray Imaging Centre, University of Southampton, UK, using a Zeiss Xradia 510 Versa X-ray tomography scanner. Employing a rotational target system, the scanner operated at a voltage of 110 kV and a power of 10 W. Projections were reconstructed using Zeiss Xradia software, resulting in 16-bit greyscale .tiff stacks characterised by a voxel size of 1.75 μm and an average dimension of 992 x 1015 pixels for each 2D slice. Generation of training sets We extracted the external calcite and internal cavity spaces from the micro-CT scans of the 50 individuals using manual segmentation within Dragonfly v. 2021.3 (Object Research Systems, Canada). This step took approximately 480 minutes per specimen (24,000 minutes total) and involved the manual labelling of 11,947 2D images. Segmentation data for each specimen were exported as multi-label (3 labels: external, internal, and background) 8-bit multipage .tiff stacks and paired with the original CT image data to allow for training (see figure 2). The 50 specimens were categorised into three distinct groups (electronic supplementary material, table S1): 20 training image stacks, 10 validation image stacks, and 20 test image stacks. From the training image category, we generated six distinct training sets, varying in size from 1 to 20 specimens (see table 1). These were used to assess the impact of training set size on segmentation accuracy, as determined through a comparative analysis against the validation set (see Section 2.3). From the initial six training sets, we created six additional training sets through data augmentation using the NumPy library [17] in Python. This augmentation method was chosen for its simplicity and accessibility to researchers with limited computational expertise, as it can be easily implemented using a straightforward batch code. This augmentation process entailed rotating the original images five times (the maximum amount permitted using this method), effectively producing six distinct 3D orientations per specimen for each of the original training sets (see figure 3). The augmented training sets comprised between 6 and 120 .tiff stacks (see table 1). Training the neural networks CNNs were trained using the offline version of Biomedisa, which utilises a 3D U-Net architecture [18] – the primary model employed for image segmentation [19], and is optimised using Keras with a TensorFlow backend. We used patches of size 64 x 64 x 64 voxels, which were then scaled to a size of 256 x 256 x 256 voxels. This scaling was performed to improve the network’s ability to capture spatial features and mitigate potential information loss during training. We trained 3 networks for each of the training sets to check the extent of stochastic variation on the results [20]. To train our models in Biomedisa, we used a stochastic gradient descent with a learning rate of 0.01, a decay of 1 × 10-6, momentum of 0.9, and Nesterov momentum enabled. A stride size of 32 pixels and a batch size of 24 samples per epoch were used alongside an automated cropping feature, which has been demonstrated to enhance accuracy [21]. The training of each network was performed on a Tesla V100S-PCIE-32GB graphics card with 30989 MB of available memory. All the analyses and training procedures were conducted on the High-Performance Computing (HPC) system at the Natural History Museum, London. To measure network accuracy, we used the Dice similarity coefficient (Dice score), a metric commonly used in used in biomedical image segmentation studies [22, 23]. The Dice score quantifies the level of overlap between two segmentations, providing a value between 0 (no overlap) and 1 (perfect match). We conducted experiments to evaluate the potential efficiency gains of using an early stopping mechanism within Biomedisa. After testing a variety of epoch limits, we opted for an early stopping criterion set at 25 epochs, which was found to be the lowest value as to which all models trained correctly for every training set. By “trained correctly” we mean if there is no increase in Dice score within a 25-epoch window, the optimal network is selected, and training is terminated. To gauge its impact of early stopping on network accuracy, we compared the results obtained from the original six training sets under early stopping to those obtained on a full run of 200 epochs. Evaluation of feature extraction We used the median accuracy network from each of the 12 training sets to produce segmentation data for the external and internal structures of the 20 test specimens. The median accuracy was selected as it provides a more robust estimate of performance by ensuring that outliers had less impact on the overall result. We then compared the volumetric and shape measurements from the manual data to those from each training set. The volumetric measurements were total volume (comprising both external and internal volumes) and percentage calcite (calculated as the ratio of external volume to internal volume, multiplied by 100). To compare shape, mesh data for the external and internal structures was generated from the segmentation data of the 12 training sets and the manual data. Meshes were decimated to 50,000 faces and smoothed before being scaled and aligned using Python and Generalized Procrustes Surface Analysis (GPSA) [24], respectively. Shape was then analysed using the landmark-free morphometry pipeline, as outlined by Toussaint et al., [25]. We used a kernel width of 0.1mm and noise parameter of 1.0 for both the analysis of shape for both the external and internal data, using a Keops kernel (PyKeops; https://pypi.org/project/pykeops/) as it performs better with large data [25]. The analyses were run for 150 iterations, using an initial step size of 0.01. The manually generated mesh for the individual st049_bl1_fo2 was used as the atlas for both the external and internal shape comparisons.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prismatic slip in magnesium at temperatures T ≲ 150 K occurs at ∼ 100 MPa independent of temperature, and jerky flow due to large prismatic dislocation glide distances is observed; this athermal regime is not understood. In contrast, the behavior at T ≳ 150 K is understood to be governed by a thermally-activated double-cross-slip of the stable basal screw dislocation through an unstable or weakly metastable prism screw configuration and back to the basal screw. Here, a range of neural network potentials (NNPs) that are very similar for many properties of Mg including the basal-prism-basal cross-slip path and pro- cess, are shown to have an instability in prism slip at a potential-dependent critical stress. One NNP, NNP-77, has a critical instability stress in good agreement with experiments and also has basal-prism-basal transition path energies in very good agreement with DFT results, making it an excellent potential for understanding Mg prism slip. Full 3d simulations of the expansion of a prismatic loop using NNP-77 then also show a transition from cross-slip onto the basal plane at low stresses to prismatic loop expansion with no cross- slip at higher stresses, consistent with in-situ TEM observations. These results reveal (i) the origin and prediction of the observed unstable low-T prismatic slip in Mg and (ii) the critical use of machine-learning potentials to guide discovery and understanding of new important metallurgical behavior.
Facebook
TwitterI hope that by uploading some of the data previously shared by [SETI][1] onto Kaggle, more people will become aware of SETI’s work and become engaged in the application of machine learning to the data (amongst other things). Note, I am in no way affiliated with SETI, I just think this is interesting data and amazing science.
If you’re reading this, then I’m guessing you have an interest in data science. And if you have an interest in data science, you’ve probably got an interest in science in general.
Out of every scientific endeavour undertaken by humanity, from mapping the human genome to landing a man on the moon, it seems to me that the Search for Extra-terrestrial Intelligence (SETI) has the greatest chance to fundamentally change how we think about our place in the Universe.
Just imagine if a signal was detected. Not natural. Not human. On the one hand it would be a Copernican-like demotion of mankind’s central place in the Cosmos, and on the other an awe-inspiring revelation that somewhere out there, at least once, extra-terrestrial intelligence emerged.
Over the past few years, SETI have launched a few initiatives to engage the public and ‘citizen scientists’ to help with their search. Below is a summary of their work to date (from what I can tell).
In January 2016, the Berkeley SETI Research Center at the University of Berkley started a program called Breakthrough Listen, described as “*the most comprehensive search for alien communications to date*”. Radio data is being currently been collected by the Green Bank Observatory in West Virginia and the Parkes Observatory in New South Wales, with optical data being collected by the Automated Planet finder in California. Note that (for now at least), the rest of this description focusses on the radio data.
The basic technique for finding a signal is this; point the telescope at a candidate object and listen for 5 minutes. If any sort of signal is detected, point slightly away and listen again. If the signal drops away, then it’s probably not terrestrial. Go back to the candidate and listen again. Is the signal still there? Now point to a second, slightly different position. How about now? The most interesting finding is, as you might expect, SIGNAL - NO SIGNAL – SIGNAL - NO SIGNAL – SIGNAL.
The Breakthrough Listen project has just about everything covered. The hardware and software to collect signals, the time, the money, and the experts to run the project. The only sticking point is the data. Even after compromising on the raw data’s time or frequency resolution, Breakthrough Listen is archiving 500GB and data every hour (!).
The resulting data are stored in something called a filterbank file, which are created at three different frequency resolutions. These are,
To engage the public, Breakthrough listen’s primary method is something called SETI@Home, where a program can be downloaded and installed, and your PC used when idle to download packets of data and run various analysis routines on them.
Beyond this, they have shared a number of starter scripts and some data. To find out more, a general landing page can be found [here][2]. The scripts can be found on GitHub [here]3, and a data archive can be found [here]4. Note that the optical data from the Automated Planet Finder is also in a different format called a FITS file.
The second initiative by SETI to engage the public was the SETI@IBMCloud project launched in September 2016. This provided the public with access to an enormous amount of data via the IBM Cloud platform. This initiative, too, came with an excellent collection of starter scripts which can still be found on GitHub [here][5]. Unfortunately, at the time of writing, this project is on hold and the data cannot be accessed.
There are a few other sources of data online from SETI, one of which is the basis for this dataset.
In the summer of 2017, SETI hosted a machine learning challenge where simulated datasets of various sizes were provided to participants along with a blinded test set. The winning team achieved a classification accuracy of 94.67% using a convolution neural network. The aim of this challenge was to attempt a novel approach to signal detection, namely to go beyond traditional signal analysis approaches and to turn the problem into an image classification task, after converting the signals into spectrograms.
The primary traini...
Facebook
Twitter
According to our latest research, the Binary Neural Network SRAM market size reached USD 1.28 billion in 2024, demonstrating robust momentum driven by the surging adoption of edge AI and energy-efficient memory solutions. The market is expected to grow at a CAGR of 21.7% from 2025 to 2033, propelling the market value to approximately USD 8.93 billion by 2033. This remarkable growth is primarily fueled by escalating demand for high-performance, low-power SRAM in AI accelerators, IoT devices, and next-generation data centers, as organizations worldwide intensify their investments in edge computing and artificial intelligence infrastructure.
The rapid proliferation of artificial intelligence and machine learning applications across diverse industries is a significant growth driver for the Binary Neural Network SRAM market. As AI models become more complex, there is a critical need for memory architectures that can deliver high-speed data access while minimizing power consumption. Binary neural networks, which utilize quantized weights and activations, enable substantial reductions in memory footprint and computational requirements. SRAM, with its inherent speed and low latency, is increasingly being integrated into AI accelerators and edge devices to support real-time inference and on-device intelligence. This trend is especially pronounced in sectors such as consumer electronics, automotive, and healthcare, where energy efficiency and rapid decision-making are paramount.
Another key factor contributing to the expansion of the Binary Neural Network SRAM market is the evolution of edge computing and the Internet of Things (IoT). As more devices become interconnected and capable of processing data locally, there is a growing emphasis on deploying AI models at the edge, closer to the source of data generation. This shift necessitates memory solutions that offer high throughput, low latency, and minimal power draw, making SRAM an ideal choice for binary neural network implementations. The integration of SRAM in edge AI chips is enabling new use cases in smart homes, industrial automation, and autonomous vehicles, further accelerating market growth.
Technological advancements in SRAM architectures, such as the development of 6T, 8T, and 10T SRAM cells, are also playing a pivotal role in shaping the Binary Neural Network SRAM market. These innovations are enhancing the density, reliability, and scalability of SRAM, allowing for more efficient deployment of binary neural networks in increasingly compact and power-constrained environments. The continuous miniaturization of semiconductor nodes and the adoption of advanced fabrication techniques are expected to unlock new opportunities for market participants, as they strive to meet the evolving demands of AI-driven applications.
From a regional perspective, Asia Pacific is emerging as the dominant force in the Binary Neural Network SRAM market, driven by the presence of leading semiconductor manufacturers, robust investments in AI research, and the rapid expansion of consumer electronics and automotive industries. North America and Europe are also witnessing substantial growth, fueled by advancements in AI hardware, strong R&D ecosystems, and increasing adoption of edge computing solutions. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, supported by government initiatives and growing digitization efforts. The global landscape is characterized by intense competition and a relentless pursuit of innovation, as companies seek to capitalize on the burgeoning demand for AI-optimized memory solutions.
The Binary Neural Network SRAM market is segmented by product type into Embedded SRAM and Standalone SRAM, each catering to distinct application requirements and industry needs. Embedded SRAM, integrated directly into system-on-chip (SoC) architectures, has gained significant traction due to its ability to pr
Facebook
TwitterFostered by technological and theoretical developments, deep neural networks (DNNs) have achieved great success in many applications, but their training via mini-batch stochastic gradient descent (SGD) can be very costly due to the possibly tens of millions of parameters to be optimized and the large amounts of training examples that must be processed. The computational cost is exacerbated by the inefficiency of the uniform sampling typically used by SGD to form the training mini-batches: since not all training examples are equally relevant for training, sampling these under a uniform distribution is far from optimal, making the case for the study of improved methods to train DNNs. A better strategy is to sample the training instances under a distribution where the probability of being selected is proportional to the relevance of each individual instance; one way to achieve this is through importance sampling (IS), which minimizes the gradients’ variance w.r.t. the network parameters, consequently improving convergence. In this paper, an IS-based adaptive sampling method to improve the training of DNNs is introduced. This method exploits side information to construct the optimal sampling distribution and is dubbed regularized adaptive sampling (RAS). Experimental comparison using deep convolutional networks for classification of the MNIST and CIFAR-10 datasets shows that when compared against SGD and against another sampling method in the state of the art, RAS produces improvements in the speed and variance of the training process without incurring significant overhead or affecting the classification.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Influence of the training set on the generalizability of a model. For each dataset, models were trained on 100 different training sets containing 4 images each and were validated on the remaining images. The difference between the maximum and minimum Dice scores obtained for each sample across all runs was calculated and averaged over all samples. The standard deviation of the differences is also shown to provide a reference regarding the degree of variation observed among samples.
Facebook
TwitterThis model archive contains the input data, model code, and model outputs for machine learning models that predict daily non-tidal stream salinity (specific conductance) for a network of 459 modeled stream segments across the Delaware River Basin (DRB) from 1984-09-30 to 2021-12-31. There are a total of twelve models from combinations of two machine learning models (Random Forest and Recurrent Graph Convolution Neural Networks), two training/testing partitions (spatial and temporal), and three input attribute sets (dynamic attributes, dynamic and static attributes, and dynamic attributes and a minimum set of static attributes). In addition to the inputs and outputs for non-tidal predictions provided on the landing page, we also provide example predictions for models trained with additional tidal stream segments within the model archive (TidalExample folder), but we do not recommend our models for this use case. Model outputs contained within the model archive include performance metrics, plots of spatial and temporal errors, and Shapley (SHAP) explainable artificial intelligence plots for the best models. The results of these models provide insights into DRB stream segments with elevated salinity, and processes that drive stream salinization across the DRB, which may be used to inform salinity management. This data compilation was funded by the USGS.
Facebook
TwitterAs digital media is growing the competition between online platforms also has rapidly increased. Online platforms like Buzzfeed, Mashable, Medium, towards data science publish hundreds of articles of every day. In this report, we analyze the Mashable dataset which consists of articles data information mainly as a number of unique words, number of non-stop words, the postpositive polarity of words, negative polarity of words, etc. Here we intend to predict the number of shares that articles can be shared. This will be very helpful for Mashable to decide which articles should they publish because they can actually predict which articles will be having the maximum number of shares. Random forest regression has been used to predict the number of shares and it can achieve an accuracy of 70% with Parameter tuning. As there is the number of articles that will be collected from different ways but to classify or group these articles into separate categories for an online platform it will be a difficult job. To handle this problem, in this report we have used neural-networks to classify the articles into different categories. By doing so, the people doesn't need to do an extensive search because the Mashable can keep an interface with articles classified into different categories which in-turn will help people to choose the category and directly search their articles.
With the growth of the Internet in daily life, people are in a minute away to read the news or watch any entertainment or read articles of different categories. As the growth of the internet, even the usage by the people of it has increased rapidly, it actually became their part of life. Nowadays as people using the internet more, they are studying the articles for their knowledge or news or of any sector online. As the demand is increased even online platforms rivalry has increased. Due to this, every online platform is striving to publish the articles on their site which have great value and bring most shares. In this project, we do the prediction of shares of an article based on the data produced by ‘Mashable’ where they collected data of around 39000 articles. For this prediction, we have used Random forest Regression. In this report will be discussing why the Random forest Regression has been choosing for the prediction of shares by analyzing the Data set and doing cross-tabulation, what is the variance of the dataset and how many levels of bias it is with-holding. Even discuss about the features selection and why decided to do some feature engineering and how it will be helpful in increasing the accuracy. Even in this report, we discuss how these predictions will be helpful for Mashable organization on their decision of publishing the articles.
In this paper, we will see to handle the issue of classifying articles such as entertainment, news, lifestyle, technology, etc. To obtain this classification used the neural networks. In this paper, we will discuss why did we choose the neural networks for classification and what type of feature engineering has been used. At what levels of hidden layers and neurons the model is being affected at what stages model got started getting overfitted. For classification after the output layer, we used soft-max function. In this paper, an 11 layer neural network classifier has been used and achieved around 80% of accuracy. Methods used to achieve this accuracy are constant check rate of accuracy with different layers and neurons, standardization technique and feature selection using a correlation matrix.
Related work on the study and analysis of Online News Popularity is done by the Shuo Zhang from Australian National University where they predicted the article will be popular or not and used binary neural network classification. The other related works also achieved greater accuracy of 70% but here they actually predicted the shares by applying different regression techniques. This paper was worked by He Ren and Quan Yang work in DepaDepartment of Electrical Engineering at Stanford University.
Bringing value from a heavy data set. How does this value will be helpful to Organizations. Analyzing the large volumes of data and how to bring the values from it. Correlating the features and calculating the predictability power to the target variable which we are predicting. Selection of different Machine Learning algorithms and their compatibility. Neural Networks works efficient for high dimensional data sets but Needed a very high computational time.
• Predicting the number of shares an article can get it • Classifying the articles into different categories? • Which category of article should be published maximum for higher number of shares? • On What week-day What type of article should Mashable post more? • For different categories of articles what should be their min and max content length?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
From the DeepTriage abstract:
For a given software bug report, identifying an appropriate developer who could potentially fix the bug is the primary task of a bug triaging process. A bug title (summary) and a detailed description is present in most of the bug tracking systems. Automatic bug triaging algorithm can be formulated as a classification problem, which takes the bug title and description as the input, mapping it to one of the available developers (class labels). The major challenge is that the bug description usually contains a combination of free unstructured text, code snippets, and stack trace making the input data highly noisy. In the past decade, there has been a considerable amount of research in representing a bug report using tf-idf based bag-of-words feature (BOW) model. However, BOW model do not consider the syntactical and sequential word information available in the descriptive sentences.
In this research, we propose a novel bug report representation algorithm using an attention based deep bidirectional recurrent neural network (DBRNN-A) model that learns a syntactic and semantic feature from long word sequences in an unsupervised manner. Instead of BOW features, the DBRNN-A based robust bug representation is then used for training the classification model. Further, using an attention mechanism enables the model to learn the context representation over a long word sequence, as in a bug report. To provide a large amount of data to learn the feature learning model, the unfixed bug reports (constitute about 70% bugs in an open source bug tracking system) are leveraged upon as an important contribution of this research, which were completely ignored in the previous studies.
Another major contribution is to make this research reproducible by making the source code available and creating a public benchmark dataset of bug reports from three open source bug tracking system: Google Chromium, Mozilla Core, and Mozilla Firefox. For our experiments, we use 383,104 bug reports from Google Chromium, 314,388 bug reports from Mozilla Core, and 162,307 bug reports from Mozilla Firefox. Experimentally we compare our approach with BOW model and softmax classifier, support vector machine, naive Bayes, and cosine distance and observe that DBRNN-A provides a higher rank-10 average accuracy.
This dataset contains the bug data for Google Chromium with four different training sets and one test set.
classifier_data_0.csv is a version of training data with no minimum number of occurrences for any class (most unbalanced).
classifier_data_5.csv contains a version of the training data where every class occurs at least 5 times.
classifier_data_10.csv contains a version of the training data where every class occurs at least 10 times.
classifier_data_20.csv contains a version of the training data where every class occurs at least 20 times. (most balanced)
deep_data.csv contains the test data.
*In this data, the classes are the owners.
DeepTriage: Exploring the Effectiveness of Deep Learning for Bug Triaging. Senthil Mani, Anush Sankaran, Rahul Aralikatte, IBM Research, India.
The dataset, code and paper can be found at this webpage: http://bugtriage.mybluemix.net/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These files contain datasets used in the article titled "Machine learning based prediction of dynamical clustering in granular gases". Each of these files contains data described as follows:
Facebook
Twitter
According to our latest research, the global spiking neural network chip market size reached USD 690 million in 2024, reflecting robust momentum driven by the proliferation of neuromorphic computing applications. The market is expected to expand at a remarkable CAGR of 23.1% from 2025 to 2033, reaching a forecasted value of USD 4.35 billion by 2033. This impressive growth trajectory is primarily attributed to the escalating demand for ultra-efficient, brain-inspired computing solutions in edge devices, autonomous systems, and next-generation artificial intelligence (AI) applications.
The exponential growth in the spiking neural network chip market is underpinned by the surging adoption of AI-driven technologies across various industries. As organizations seek to emulate the human brainÂ’s efficiency in processing information, spiking neural network (SNN) chips are gaining prominence for their ability to process event-driven data with ultra-low power consumption and high parallelism. This unique capability is especially critical in applications such as robotics, autonomous vehicles, and edge computing, where real-time decision-making and energy efficiency are paramount. The convergence of AI, IoT, and neuromorphic hardware is further catalyzing this market, as industries increasingly prioritize hardware acceleration for AI workloads at the edge.
Another significant growth factor is the rapid evolution of semiconductor technologies, which has enabled the development of more sophisticated and scalable SNN chips. Advances in complementary metal-oxide-semiconductor (CMOS), memristor, and field-programmable gate array (FPGA) technologies are facilitating the integration of spiking neural networks into a diverse range of devices and systems. This technological progress is empowering manufacturers to deliver chips with enhanced computational capabilities, reduced latency, and improved energy efficiency. Moreover, the growing availability of open-source neuromorphic frameworks and software tools is lowering the barriers to entry for developers, further fueling innovation and adoption in this market.
Strategic collaborations between technology giants, research institutions, and startups are also playing a pivotal role in driving market expansion. Leading companies are investing heavily in R&D to refine SNN chip architectures and explore new use cases, from industrial automation to healthcare diagnostics. The increasing focus on edge AI and the need for real-time, context-aware intelligence in smart devices are prompting end-users to transition from conventional deep learning accelerators to SNN-based solutions. These collaborative efforts are accelerating the commercialization of spiking neural network chips and fostering a dynamic ecosystem that supports long-term market growth.
The emergence of Neuromorphic Computing Chip technology is revolutionizing the landscape of spiking neural network chips. These chips are designed to mimic the human brain's neural architecture, allowing for more efficient data processing and energy consumption. This technology is particularly beneficial in applications requiring rapid, real-time data processing, such as autonomous vehicles and robotics. By leveraging the principles of neuromorphic computing, these chips can handle complex computations with minimal power usage, making them ideal for edge devices where energy efficiency is critical. As the demand for smarter, more efficient computing solutions grows, the integration of neuromorphic computing chips into various applications is expected to drive significant advancements in the field.
Regionally, Asia Pacific stands out as the fastest-growing market, fueled by large-scale investments in AI research, semiconductor manufacturing, and smart infrastructure projects. North America remains a major hub for innovation, driven by the presence of leading chip manufacturers and a strong emphasis on autonomous systems development. Meanwhile, Europe is witnessing significant traction in industrial automation and automotive applications, supported by robust government initiatives and research funding. The global competitive landscape is becoming increasingly dynamic, with companies vying for leadership through technological differentiation and strategic partnerships.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The full description can also be found in README.md.
The dataset was used in the paper “Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks”:
https://arxiv.org/abs/2504.21844
This paper presents a scalable Heterogeneous graph network with integrated pruning layers, which jointly determines if tracks originate from decay of beauty hadrons and associates each track to a proton-proton collision point known as a primary vertex (PV).
For training HGNNs and GNNs on the dataset see the associated github repo:
https://github.com/willsutcliffe/scalable_mtl_hgnn
The events in this dataset are based on simulation generated with PYTHIA and EvtGen, in which the particle-collision conditions expected for the LHC Run 3 are replicated as shown in the table.
| LHCb period | Num. vis. pp collisions | Num. tracks | Num. b hadrons | Num. c hadrons |
| Runs 3-4 (Upgrade I) | ~5 | ~150 | < 1 | ~1 |
Additionally, an approximate emulation of the LHCb detection and reconstruction effects is applied, as described in the paper in the appendix “Simulation”. In the generated dataset, each event is required to contain at least one b-hadron, which is subsequently allowed to decay freely through any of the standard decay modes present in PYTHIA8. On average, 40% of those events contain more than one b-hadron decay, with a maximum b-hadron decay multiplicity of five. Only charged stable particles that have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region (as defined in the paper) are included in the datasets.
The datasets are divided in three categories
Inclusive training and validation
The file compressed file inclusive_training_validation_dataset.tar.gz contains the training dataset (40,000 events) test dataset (10,000 events) of inclusive decays.
Inclsuive test
The inclusive dataset inclusive_test_dataset.tar.gz contains the evaluation events (10,000).
Exclusive test and training
We provide samples of 5,000 events in which one decay is required to decay to a specific decay (an exclusive decay). For certain exclusive decay modes we separate the 5,000 events into an 1,000 event training set and 4,000 test set for the training of the HGNN (H2) in the paper.
Exclusive decays include:
Bd_DD_dataset.tar.gzBd_Kstmumu_dataset.tar.gzBd_Kpi_dataset.tar.gzBu_Kmumu_dataset.tar.gzBu_Kpipimumu_dataset.tar.gzBu_KKpi_dataset.tar.gzLb_Lcpi_dataset.tar.gzLb_pK_dataset.tar.gzLb_pKmumu_dataset.tar.gzBs_Dspi_dataset.tar.gzBs_Jpsiphi_dataset.tar.gzThe datasets we provide here overlap in some cases with our previous dataset for the Deep Full Event Interpretation at :
https://zenodo.org/records/7799170
which, provides several of the datasets in .root format with more available information. Here, we provide a more amenable format of the data for trainings with GNNs and HGNNs with pytorch with our latest framework.
The relevant features used in the HGNN are described in the following. A cartesian right-handed coordinate system is used, with the z axis pointing along the beamline, the x axis beinng parallel to the horizontal and the y axis being vertically oriented.
Events are stored in a graph format in the files of the format input_.npy where each numbered input file represents a unique event. Meanwhile, LCAG (Lowest Common Ancestor Generation) edge targets are contained within the files target_.npy
In the input files the following graph data is stored in a dictionary format
node features
Are contained in key value 'nodes' in a numpy array format (n_nodes, 13) and include in index order:
edge features
Are contained in key value 'edges' a numpy array format (n_edges, 4) and include in index order:
Opening angle (θ): angle between the three-momentum directions of the two particles.
Momentum-transverse distance (d ⊥ P⃗): distance between the origin point of the two particles defined on a plane which is transverse to the combined three momentum of the two particles.
Distance along the beam axis (Δz): difference between the z-coordinate of the origin points of the two particles.
FromSamePV_MinIP: a reconstructed boolean variable indicating whether the two particles share the same associated primary vertex accordining to minimum impact paramter
track edge relations
The keys 'senders' and' receivers' yield the numpy arrays of sender and receiver node indices for tracks.
global features
targets
Meanwhile in the the target files the Lowest Common Ancestor Generation (LCAG) edge targets can be found in a one hot encoded format with the key value 'edges' in a numpy array of shape (n_edges,4) with 4 referring to the 4 LCAG classes (0, 1, 2, 3)
Additional truth information for performance
For determining the reconstruction performance metrics in the papers additional truth information is required including LCAG mother particle identification numbers (MIDs) and particle identification numbers (IDs).
For the test datasets we include the following information:
import numpy as np
load graph features and LCAG edge targets for event 0
graph_input_features = np.load("input_0.npy", allow_pickle=True).item()
graph_target = np.load("target_0.npy",
Facebook
TwitterAbstract Objective: To determinate the accuracy of computed tomography (CT) imaging assessed by deep neural networks for predicting the need for mechanical ventilation (MV) in patients hospitalized with severe acute respiratory syndrome due to coronavirus disease 2019 (COVID-19). Materials and Methods: This was a retrospective cohort study carried out at two hospitals in Brazil. We included CT scans from patients who were hospitalized due to severe acute respiratory syndrome and had COVID-19 confirmed by reverse transcriptionpolymerase chain reaction (RT-PCR). The training set consisted of chest CT examinations from 823 patients with COVID-19, of whom 93 required MV during hospitalization. We developed an artificial intelligence (AI) model based on convolutional neural networks. The performance of the AI model was evaluated by calculating its accuracy, sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve. Results: For predicting the need for MV, the AI model had a sensitivity of 0.417 and a specificity of 0.860. The corresponding area under the ROC curve for the test set was 0.68. Conclusion: The high specificity of our AI model makes it able to reliably predict which patients will and will not need invasive ventilation. That makes this approach ideal for identifying high-risk patients and predicting the minimum number of ventilators and critical care beds that will be required.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the supplemental data for the manuscript titled Characterization of mixing in nanoparticle hetero-aggregates using convolutional neural networks submitted to Nano Select.
Motivation:
Detection of nanoparticles and classification of the material type in scanning transmission electron microscopy (STEM) images can be a tedious task, if it has to be done manually. Therefore, a convolutional neural network is trained to do this task for STEM-images of TiO2-WO3 nanoparticle hetero-aggregates. The present dataset contains the training data and some jupyter-notebooks that can be used after installation of the MMDetection toolbox (https://github.com/open-mmlab/mmdetection) to train the CNN. Details are provided in the manuscript submitted to Nano Select and in the comments of the jupyter-notebooks.
Authors and funding:
The present dataset was created by the authors. The work was funded by the Deutsche Forschungsgemeinschaft within the priority program SPP2289 under contract numbers RO2057/17-1 and MA3333/25-1.
Dataset description:
Four jupyter-notebooks are provided, which can be used for different tasks, according to their names. Details can be found within the comments and markdowns. These notebooks can be run after installation of MMDetection within the mmdetection folder.
particle_detection_training.ipynb: This notebook can be used for network training.
particle_detection_evaluation.ipynb: This notebook is for evaluation of a trained network with simulated test images.
particle_detection_evaluation_experiment.ipynb: This notebook is for evaluation of a trained network with experimental test images.
particle_detection_measurement_experiment.ipynb: This notebook is for application of a trained network to experimental data.
In addition, a script titled particle_detection_functions.py is provided which contains functions required by the notebooks. Details can be found within the comments.
The zip archive training_data.zip contains the training data. The subfolder HAADF contains the images (sorted as training, validation and test images), the subfolder json contains the annotation (sorted as training, validation and test images). Each file within the json folder provides for each image the following information:
aggregat_no: image id, the number of the corresponding image file
particle_position_x: list of particle position x-coordinates in nm
particle_position_y: list of particle position y-coordinates in nm
particle_position_z: list of particle position z-coordinates in nm
particle_radius: list of volume equivalent particle radii in nm
particle_type: list of material types, 1: TiO2, 2: WO3
particle_shape: list of particle shapes: 0: sphere, 1: box, 2: icosahedron
rotation: list of particle rotations in rad. Each particle is rotated twice by the listed angle (before and after deformation)
deformation: list of particle deformations. After the first rotation the particle x-coordinates of the particle’s surface mesh are scaled by the factor listed in deformation, y- and z-coordinates are scaled according to 1/sqrt(deformation).
cluster_index: list of cluster indices for each particle
initial_cluster_index: list of initial cluster indices for each particle, before primary clusters of the same material were merged
fractal_dimension: the intended fractal dimension of the aggregate
fractal_dimension_true: the realized geometric fractal dimension of the aggregate (neglecting particle densities)
fractal_dimension_weight_true: the realized fractal dimension of the aggregate (including particle densities)
fractal_prefactor: fractal prefactor
mixing_ratio_intended: the intended mixing ratio (fraction of WO3 particles)
mixing_ratio_true: the realised mixing ratio (fraction of WO3 particles)
mixing_ratio_volume: the realised mixing ratio (fraction of WO3 volume)
mixing_ratio_weight: the realised mixing ratio (fraction of WO3 weight)
particle_1_rho: density of TiO2 used for the calculations
particle_1_size_mean: mean TiO2 radius
particle_1_size_min: smallest TiO2 radius
particle_1_size_max: largest TiO2 radius
particle_1_size_std: standard deviation of TiO2 radii
particle_1_clustersize: average TiO2 cluster size
particle_1_clustersize_init: average TiO2 cluster size of primary clusters (before merging into larger clusters)
particle_1_clustersize_init_intended: intended TiO2 cluster size of primary clusters
particle_2_rho: density of WO3 used for the calculations
particle_2_size_mean: mean WO3 radius
particle_2_size_min: smallest WO3 radius
particle_2_size_max: largest WO3 radius
particle_2_size_std: standard deviation of WO3 radii
particle_2_clustersize: average WO3 cluster size
particle_2_clustersize_init: average WO3 cluster size of primary clusters (before merging into larger clusters)
particle_2_clustersize_init_intended: intended WO3 cluster size of primary clusters
number_of_primary_particles: number of particles within the aggregate
gyration_radius_geometric: gyration radius of the aggregate (neglecting particle densities)
gyration_radius_weighted: gyration radius of the aggregate (including particle densities)
mean_coordination: mean total coordination number (particle contacts)
mean_coordination_heterogen: mean heterogeneous coordination number (contacts with particles of the different material)
mean_coordination_homogen: mean homogeneous coordination number (contacts with particles of the same material)
radius_equiv: list of area equivalent particle radii (in projection)
k_proj: projection direction of the aggregate: 0: z-direction (axis = 2), 1: x-direction (axis = 1), 2: y-direction (axis = 0)
polygons: list of polygons that surround the particle (COCO annotation)
bboxes: list of particle bounding boxes
aggregate_size: projected area of the aggregate translated into the radius of a circle in nm
n_pix: number of pixel per image in horizontal and vertical direction (squared images)
pixel_size: pixel size in nm
image_size: image size in nm
add_poisson_noise: 1 if poisson noise was added, 0 otherwise
frame_time: simulated frame time (required for poisson noise)
dwell_time: dwell time per pixel (required for poisson noise)
beam_current: beam current (required for poisson noise)
electrons_per_pixel: number of electrons per pixel
dose: electron dose in electrons per Å2
add_scan_noise: 1 if scan noise was added, 0 otherwise
beam misposition: parameter that describes how far the beam can be misplaced in pm (required for scan noise)
scan_noise: parameter that describes how far the beam can be misplaced in pixel (required for scan noise)
add_focus_dependence: 1 if a focus effect is included, 0 otherwise
data_format: data format of the images, e.g. uint8
There are 24000 training images, 5500 validation images, 5500 test images, and their corresponding annotations. Aggregates and STEM images were obtained with the algorithm explained in the main work. The important data for CNN training is extracted from the files of individual aggregates and concluded in the subfolder COCO. For training, validation and test data there is a file annotation_COCO.json that includes all information required for the CNN training.
The zip archive experiment_test_data.zip includes manually annotated experimental images. All experimental images were filtered as explained in the main work. The subfolder HAADF includes thirteen images. The subfolder json includes an annotation file for each image in COCO format. A single file concluding all annotations is stored in json/COCO/annotation_COCO.json.
The zip archive experiment_measurement.zip includes the experimental images investigated in the manuscript. It contains four subfolders corresponding to the four investigated samples. All experimental images were filtered as explained in the manuscript.
The zip archive particle_detection.zip includes the network, that was trained, evaluated and used for the investigation in the manuscript. The network weights are stored in the file particle_detection/logs/fit/20230622-222721/iter_60000.pth. These weights can be loaded with the jupyter-notebook files. Furthermore, a configuration file, which is required by the notebooks, is stored as particle_detection/logs/fit/20230622-222721/config_file.py.
There is no confidential data in this dataset. It is neither offensive, nor insulting or threatening.
The dataset was generated to discriminate between TiO2 and WO3 nanoparticles in STEM-images. It might be possible that it can discriminate between different materials if the STEM contrast is similar to the contrast of TiO2 and WO3 but there is no guarantee.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Biometrics is the process of measuring and analyzing human characteristics to verify a given person's identity. Most real-world applications rely on unique human traits such as fingerprints or iris. However, among these unique human characteristics for biometrics, the use of Electroencephalogram (EEG) stands out given its high inter-subject variability. Recent advances in Deep Learning and a deeper understanding of EEG processing methods have led to the development of models that accurately discriminate unique individuals. However, it is still uncertain how much EEG data is required to train such models. This work aims at determining the minimal amount of training data required to develop a robust EEG-based biometric model (+95% and +99% testing accuracies) from a subject for a task-dependent task. This goal is achieved by performing and analyzing 11,780 combinations of training sizes, by employing various neural network-based learning techniques of increasing complexity, and feature extraction methods on the affective EEG-based DEAP dataset. Findings suggest that if Power Spectral Density or Wavelet Energy features are extracted from the artifact-free EEG signal, 1 and 3 s of data per subject is enough to achieve +95% and +99% accuracy, respectively. These findings contributes to the body of knowledge by paving a way for the application of EEG to real-world ecological biometric applications and by demonstrating methods to learn the minimal amount of data required for such applications.