Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:
Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.
These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
Facebook
TwitterWater quality replicate sample data and field blank data was collected at the Colorado River above Imperial Dam, Colorado River below Cooper Wasteway, Yuma Main Drain, and 242 Lateral during 2017 and 2018. Instantaneous discharge data was collected at the Cooper Wasteway, Yuma Main Drain, and 242 Lateral from January 2017 to March 2019. Instantaneous discharge readings were recorded at a fixed interval of 5 minutes. Mean daily discharge data was collected at the Colorado River above Imperial Dam, Cooper Wasteway, Yuma Main Drain, and 242 Lateral from January 2017 to March 2019. Instantaneous discharge and mean daily discharge data was provided to the USGS by the International Boundary and Water Commission (IBWC). Discrete water-quality samples were collected at the Colorado River above Imperial Dam, Colorado River below Cooper Wasteway, Yuma Main Drain, and 242 Lateral during 2017, 2018, through March 2019 and values were used to compute dissolved solids concentrations using BOR's method.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Primary author details
John van Osta ORCID: 0000-0001-6196-1241 Institution: Griffith University and E2M Pty Ltd Queensland, Australia Email: john.vanosta@griffithuni.edu.au
Researchers and practictioners applying and adapting the data and code provided here are encouraged to contact the primary author should they require further information.
Sharing/Access information Licence CC BY 4.0. You are free to share and adapt the material provided provided appropriate attribution is given to the authors.
Data and File Overview
This repository provides the code base, supplementary results and example audio data to reproduce the findings of the research article: 'An active learning framework and assessment of inter-annotator agreement facilitate automated recogniser development for vocalisations of a rare species, the southern black-throated finch (Poephila cincta cincta)', published in the Journal of Ecological Informatics. Data included within this repository are listed below.
Code base The code base includes: - train_resnet.ipynb: Trains a resnet34 model on target and non-target audio segments (each 1.8 seconds in duration). Outputs a trained model (as a pth file). - predict.ipynb: Applies the trained model to unlabelled data. - BTF_detector_v1.5 is the latest version of the model, termed the 'final model' in the the research article. - audio_file_extract.ipynb to extract audio frames in accordance with the active learning function. For the purpose of manual review and inclusion in the next iteration of model training. - stratified_subsample.ipynb: Used to subsample predictions on unlabelled data that are stratified across the model prediction confidence scores (aka logit). - macro_averaged_error.ipynb: Calculate and plot macro averaged error of the model predictions against annotator labels. - inter_annotator_agreement.ipynb: Calculate and plot Krippendorff's alpha (a measure of inter-annotator agreeement) among the model's active learning iterations and human annotators. - requirements.txt: Python package requirements to run the code base.
Note: The code base has been written in Jupyter Notebooks and tested in Python version 3.6.9
Supplementary files The file Stratified_subsample_inter_annotator_agreement.xlsx contains predictions from each model iteration and annotator labels for each of the 12,278 audio frames included in the model evaluation process, as described in the research article.
Example audio data Example audio data provided include: - Target audio files (containing black-throated finch (BTF) calls) and non-target audio files (containing other environmental noises). These are split into Training and Validation sets. To follow an active learning process, each active learning 'iteration' gets added to a new folder (i.e. IT_1, IT_2, etc..). - Field recordings (10 minutes each), the majority of which contain BTF calls. These audio data were collected from a field site within the Desert Uplands Bioregion of Queensland, Austrlaia, as described and mapped in the research article. Audio data were collected using two devices: Audiomoths and Bioacoustic records (Frontier Labs), which have been separated into separate folders in the 'Field_recordings'.
Steps to reproduce
General recommendations The code base has been written in Jupyter Notebooks and tested in Python version 3.6.9. 1. Download the .zip file and extract to a folder on your machine. 2. Open a code editor that is suitable for working with Jupyter Notebook files. We recommend Microsoft's free software: Visual Studio Code (https://code.visualstudio.com/). If using Visual Studio Code, ensure the 'Python' and 'Jupyter' extensions are installed (https://code.visualstudio.com/docs/datascience/jupyter-notebooks). 3. Within the code editor open the downloaded file. 4. Setup the python environment by installing the package requirements identified within the requirements.txt file contained within the repository. The steps to setup a python environment in Visual Studio Code are described here: https://code.visualstudio.com/docs/python/environments, or more generally for python described here: https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/. This will download the necessary python packages to support the below code.
Note: We recommend running the following steps on a Windows computer with an Nvidia graphics processing unit (GPU). The code has also been tested on a Windows computer with an Intel computer processing unit (CPU), with a substantially slower runtime. Edits to the code may be required to run on a Macintosh computer or a non-Nvidia GPU; however, the core functionality will remain the same.
Active learning iterations to develop the final model: 1. Run train_resnet.ipynb to train a model from the initial target (BTF) and non-target (other environmental sounds) audio provided. The default name for the output model will be 'model.pth', however this may be adjusted manually by changing the 'MODEL_NAME' variable. The script also provides performance metrics and a confusion matrix against the validation dataset. 2. Run predict.ipynb to make predictions on unlabelled data. The default code uses the final model (BTF_trained_model_v1.5.pth), as described in the research article, however this may be adjusted to link to the model created in step 4 (by changing the 'model_path' variable). Results of this step are saved in the Sample_files\Predict_results folder. 3. Run audio_file_extract.ipynb to extract 1.8 second audio snips that have a 'BTF' confidence score of >= 0.5. these are the sounds that range from most uncertain to the model to most likely to be BTF. The logic for this cutoff is discussed in the research article's methods section. The default extraction location is 'Sample_files\Predict_results\Audio_frames_for_review'. 4. Manually review extracted audio frames and move them to the appropriate folder of the training data. E.g. for audio frames that are reviewed to contain: - BTF calls, move them to the filepath 'Sample_files\Training_clips\Train\BTF\IT_2' - Not BTF calls, move them to the filepath 'Sample_files\Training_clips\Train\Not BTF\IT_2' IT_2 represents the second active learning iteration. Ensure 30% of the files are allocated to the validation set ('Sample_files\Training_clips\Val). Note that users will need to create subfolders for each successive iteration. 5. Repeat steps 1 to 4, making sure to update the 'iterations' variable in the train_resnet.ipynb code to include all active learning iterations undertaken. For example, to include iterations 1 and 2 in the model, set the variable 'iterations'to equal ['IT_1', 'IT_2']. An example is provided in the train_resnet.ipynb code. 6. Stop the active learning process when the stopping criterion is reached (e.g. when the F1 score plateaus).
Model evaluation steps 1. Run predict.ipynb using the final model on an unlabelled test dataset. By default the unlabelled audio data that will be used is saved at the filepath: example data saved in the filepath 'Sample_files\Field_recordings\Audiomoth'. However, this should be changed to data not used to train the model, such as 'Sample_files\Field_recordings\BAR', or your own audio data. 2. Run stratified_subsample.ipynb to subsample the predictions that the final model made on the unlabelled data. A stratified subsample approach is used, whereby samples are stratified across confidence scores, which is described in the research article. The default output file is 'stratified_subsample_predictions.csv' 3. We then manually reviewed the subsamples, including a cross review by experts on the species, as detailed in the research article. We have provide the results of our model evaluation: 'Study_results\Stratified_subsample_inter_annotator_agreement.xlsx' 4. Run macro_averaged_error.ipynb and inter_annotator_agreement.ipynb to reproduce the results and plots contained within the paper.
Using the model on your own data The predict.ipynb code may be adapted to run the BTF call detection model on data outside of this repository.
Notes for running on your own data: - Accepts wav or flac files - Accepts files from Audiomoth devices, using the file naming format: 'AM###_YYYMMDD_HHMMSS' - Accepts files from Bioacoustic Recorder Devices (Frontier Labs), using the file naming format: 'BAR##_YYYMMDDTHHMMSS+TZ_REC'
Facebook
TwitterDataset Source: https://www.aicrowd.com/challenges/data-purchasing-challenge-2022
🕵️ Introduction Data for machine learning tasks usually does not come for free but has to be purchased. The costs and benefits of data have to be weighed against each other. This is challenging. First, data usually has combinatorial value. For instance, different observations might complement or substitute each other for a given machine learning task. In such cases, the decision to purchase one group of observations has to be made conditional on the decision to purchase another group of observations. If these relationships are high-dimensional, finding the optimal bundle becomes computationally hard. Second, data comes at different quality, for instance, with different levels of noise. Third, data has to be acquired under the assumption of being valuable out-of-sample. Distribution shifts have to be anticipated.
In this competition, you face these data purchasing challenges in the context of an multi-label image classification task in a quality control setting.
📑 Problem Statement
In short: You have to classify images. Some images in your training set are labelled but most of them aren't. How do you decide which images to label if you have a limited budget to do so?
In more detail: You face a multi-label image classification task. The dataset consists of synthetically generated images of painted metal sheets. A classifier is meant to predict whether the sheets have production damages and if so which ones. You have access to a set of images, a subset of which are labelled with respect to production damages. Because labeling is costly and your budget is limited, you have to decide for which of the unlabelled images labels should be purchased in order to maximize prediction accuracy.
Each of the images have a 4 dimensional label representing the presence or the absence of ['scratch_small', 'scratch_large', 'dent_small', 'dent_large'] in the images.
You are required to submit code, which can be run in three different phases:
Pre-Training Phase
In the Pre-Training Phase, your code will have access to 5,000 labelled images on a multi-label image classification task with 4 classes. It is up to you, how you wish to use this data. For instance, you might want to pre-train a classification model. Purchase Phase
In the Purchase Phase, your code, after going through the Pre-Training Phase will have access to an unlabelled dataset of 10,000 images. You will have a budget of 3,000 label purchases, that you can freely use across any of the images in the unlabelled dataset to obtain their labels. You are tasked with designing your own approach on how to select the optimal subset of 3,000 images in the unlabelled dataset, which would help you optimize your model's performance on the prediction task. You can then continue training your model (which has been pre-trained in the pre-training phase) using the newly purchased labels. Prediction Phase
In the Prediction Phase, your code will have access to a test set of 3,000 unlabelled images, for which you have to generate and submit predictions. Your submission will be evaluated based on the performance of your predictions on this test set. Your code will have access to a node with 4 CPUS, 16 GB RAM, 1 NVIDIA T4 GPU and 3 hours of runtime per submission. In the final round of this challenge, your code will be evaluated across multiple budget-runtime constraints.
💾 Dataset
The datasets for this challenge can be accessed in the Resources Section.
training.tar.gz: The training set containing 5,000 images with their associated labels. During your local experiments you are allowed to use the data as you please. unlabelled.tar.gz: The unlabelled set containing 10,000 images, and their associated labels. During your local experiments you are only allowed to access the labels through the provided purchase_label function. validation.tar.gz: The validation set containing 3,000 images, and their associated labels. During your local experiments you are only allowed to use the labels of the validation set to measure the performance of your models and experiments. debug.tar.gz.: A small set of 100 images with their associated labels, that you can use for integration testing, and for trying out the provided starter kit. NOTE While you run your local experiments on this dataset, your submissions will be evaluated on a dataset which might be sampled from a different distribution, and is not the same as this publicly released version.
👥 Participation
🖊 Evaluation Criteria The challenge will use the Accuracy Score, Hamming Loss and the Exact Match Ratio during evaluation. The primary score will be the Accuracy Score.
📅 Timeline This challenge has two Rounds.
Round 1 : Feb 4th – Feb 28th, 2022
The first round submissions will be evaluated based on one budget-compute constraint pair (max. of 3,00...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This item is part of the collection "AIS Trajectories from Danish Waters for Abnormal Behavior Detection"
DOI: https://doi.org/10.11583/DTU.c.6287841
Using Deep Learning for detection of maritime abnormal behaviour in spatio temporal trajectories is a relatively new and promising application. Open access to the Automatic Identification System (AIS) has made large amounts of maritime trajectories publically avaliable. However, these trajectories are unannotated when it comes to the detection of abnormal behaviour.
The lack of annotated datasets for abnormality detection on maritime trajectories makes it difficult to evaluate and compare suggested models quantitavely. With this dataset, we attempt to provide a way for researchers to evaluate and compare performance.
We have manually labelled trajectories which showcase abnormal behaviour following an collision accident. The annotated dataset consists of 521 data points with 25 abnormal trajectories. The abnormal trajectories cover amoung other; Colliding vessels, vessels engaged in Search-and-Rescue activities, law enforcement, and commercial maritime traffic forced to deviate from the normal course
These datasets consists of unlabelled trajectories for the purpose of training unsupervised models. For labelled datasets for evaluation please refer to the collection. Link in Related publications.
The data is saved using the pickle format for Python Each dataset is split into 2 files with naming convention:
datasetInfo_XXX
data_XXX
Files named "data_XXX" contains the extracted trajectories serialized sequentially one at a time and must be read as such. Please refer to provided utility functions for examples. Files named "datasetInfo" contains Metadata related to the dataset and indecies at which trajectories begin in "data_XXX" files.
The data are sequences of maritime trajectories defined by their; timestamp, latitude/longitude position, speed, course, and unique ship identifer MMSI. In addition, the dataset contains metadata related to creation parameters. The dataset has been limited to a specific time period, ship types, moving AIS navigational statuses, and filtered within an region of interest (ROI). Trajectories were split if exceeding an upper limit and short trajectories were discarded. All values are given as metadata in the dataset and used in the naming syntax.
Naming syntax: data_AIS_Custom_STARTDATE_ENDDATE_SHIPTYPES_MINLENGTH_MAXLENGTH_RESAMPLEPERIOD.pkl
See datasheet for more detailed information and we refer to provided utility functions for examples on how to read and plot the data.
Facebook
TwitterA dataset for unsupervised person re-identification using Generative Adversarial Networks (GANs).
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1XXDMWhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1XXDMW
Identifying important policy outputs has long been of interest to political scientists. In this work, we propose a novel approach to the classification of policies. Instead of obtaining and aggregating expert evaluations of significance for a finite set of policy outputs, we use experts to identify a small set of significant outputs and then employ positive unlabeled (PU) learning to search for other similar examples in a large unlabeled set. We further propose to automate the first step by harvesting ‘seed’ sets of significant outputs from web data. We offer an application of the new approach by classifying over 9,000 government regulations in the United Kingdom. The obtained estimates are successfully validated against human experts, by forecasting web citations, and with a construct validity test.
Facebook
TwitterSTL-10 is an image recognition dataset inspired by CIFAR-10 dataset with some improvements. With a corpus of 100,000 unlabeled images and 500 training images, this dataset is best for developing unsupervised feature learning, deep learning, self-taught learning algorithms. Unlike CIFAR-10, the dataset has a higher resolution which makes it a challenging benchmark for developing more scalable unsupervised learning methods.
Data overview:
The original data source recommends the following standardized testing protocol for reporting results:
Original data source and banner image: https://cs.stanford.edu/~acoates/stl10/
Please cite the following reference when using this dataset:
Adam Coates, Honglak Lee, Andrew Y. Ng An Analysis of Single Layer Networks in Unsupervised Feature Learning AISTATS, 2011.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Execution time average over all the data sets per unlabelled sample for the two first iteration of the AL process for each AL strategy with their standard deviation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reaching the performance of fully supervised learning with unlabeled data and only labeling one sample per class might be ideal for deep learning applications. We demonstrate for the first time the potential for building one-shot semi-supervised (BOSS) learning on CIFAR-10 and SVHN up to attain test accuracies that are comparable to fully supervised learning. Our method combines class prototype refining, class balancing, and self-training. A good prototype choice is essential and we propose a technique for obtaining iconic examples. In addition, we demonstrate that class balancing methods substantially improve accuracy results in semi-supervised learning to levels that allow self-training to reach the level of fully supervised learning performance. Our experiments demonstrate the value with computing and analyzing test accuracies for every class, rather than only a total test accuracy. We show that our BOSS methodology can obtain total test accuracies with CIFAR-10 images and only one labeled sample per class up to 95% (compared to 94.5% for fully supervised). Similarly, the SVHN images obtains test accuracies of 97.8%, compared to 98.27% for fully supervised. Rigorous empirical evaluations provide evidence that labeling large datasets is not necessary for training deep neural networks. Our code is available at https://github.com/lnsmith54/BOSS to facilitate replication.
Facebook
TwitterThese datasets were used while writing the following work:
Polo, F. M., Ciochetti, I., and Bertolo, E. (2021). Predicting legal proceedings status: approaches based on sequential text data. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, pages 264–265.
Please cite us if you use our datasets in your academic work:
@inproceedings{polo2021predicting,
title={Predicting legal proceedings status: approaches based on sequential text data},
author={Polo, Felipe Maia and Ciochetti, Itamar and Bertolo, Emerson},
booktitle={Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law},
pages={264--265},
year={2021}
}
More details below!
Every legal proceeding in Brazil is one of three possible classes of status: (i) archived proceedings, (ii) active proceedings, and (iii) suspended proceedings. The three possible classes are given in a specific instant in time, which may be temporary or permanent. Moreover, they are decided by the courts to organize their workflow, which in Brazil may reach thousands of simultaneous cases per judge. Developing machine learning models to classify legal proceedings according to their status can assist public and private institutions in managing large portfolios of legal proceedings, providing gains in scale and efficiency.
In this dataset, each proceeding is made up of a sequence of short texts called “motions” written in Portuguese by the courts’ administrative staff. The motions relate to the proceedings, but not necessarily to their legal status.
Our data is composed of two datasets: a dataset of ~3*10^6 unlabeled motions and a dataset containing 6449 legal proceedings, each with an individual and a variable number of motions, but which have been labeled by lawyers. Among the labeled data, 47.14% is classified as archived (class 1), 45.23% is classified as active (class 2), and 7.63% is classified as suspended (class 3).
The datasets we use are representative samples from the first (São Paulo) and third (Rio de Janeiro) most significant state courts. State courts handle the most variable types of cases throughout Brazil and are responsible for 80% of the total amount of lawsuits. Therefore, these datasets are a good representation of a very significant portion of the use of language and expressions in Brazilian legal vocabulary.
Regarding the labels dataset, the key "-1" denotes the most recent text while "-2" the second most recent and so on.
We would like to thank Ana Carolina Domingues Borges, Andrews Adriani Angeli, and Nathália Caroline Juarez Delgado from Tikal Tech for helping us to obtain the datasets. This work would not be possible without their efforts.
Can you develop good machine learning classifiers for text sequences? :)
Facebook
TwitterThe ability to conduct cost-effective wildlife monitoring at scale is rapidly increasing due to availability of inexpensive autonomous recording units (ARUs) and automated species recognition, presenting a variety of advantages over human-based surveys. However, estimating abundance with such data collection techniques remains challenging because most abundance models require data that are difficult for low-cost monoaural ARUs to gather (e.g., counts of individuals, distance to individuals), especially when using the output of automated species recognition. Statistical models that do not require counting or measuring distances to target individuals in combination with low-cost ARUs provide a promising way of obtaining abundance estimates for large-scale wildlife monitoring projects but remain untested. We present a case study using avian field data collected in forests of Pennsylvania during the Spring of 2020 and 2021 using both traditional point counts and passive acoustic monitoring at the same locations. We tested the ability of the Royle-Nichols and time-to-detection models to estimate abundance of two species from detection histories generated by applying a machine-learning classifier to ARU-gathered data. We compared abundance estimates from these models to estimates from the same models fit using point-count data and to two additional models appropriate for point counts, the N-mixture model and distance models. We found that the Royle-Nichols and time-to-detection models can be used with ARU data to produce abundance estimates similar to those generated by a point-count based study but with greater precision. ARU-based models produced confidence or credible intervals that were on average 31.9% ( 11.9 SE) smaller than their point-count counterpart. Our findings were consistent across two species with differing relative abundance and habitat use patterns. The higher precision of models fit using ARU data is likely due to higher cumulative detection probability, which itself may be the result of greater survey effort using ARUs and machine-learning classifiers to sample significantly more time for focal species at any given point. Our results provide preliminary support the use of ARUs in abundance-based study applications, and thus may afford researchers a better understanding of habitat quality and population trends, while allowing them to make more informed conservation actions and recommendations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data repository contains the OCT images and binary annotations for segmentation of retinal tissue using deep learning. To use, please refer to the Github repository https://github.com/theislab/DeepRT.
#######
Access to large, annotated samples represents a considerable challenge for training accurate deep-learning models in medical imaging. While current leading-edge transfer learning from pre-trained models can help with cases lacking data, it limits design choices, and generally results in the use of unnecessarily large models. We propose a novel, self-supervised training scheme for obtaining high-quality, pre-trained networks from unlabeled, cross-modal medical imaging data, which will allow for creating accurate and efficient models. We demonstrate this by accurately predicting optical coherence tomography (OCT)-based retinal thickness measurements from simple infrared (IR) fundus images. Subsequently, learned representations outperformed advanced classifiers on a separate diabetic retinopathy classification task in a scenario of scarce training data. Our cross-modal, three-staged scheme effectively replaced 26,343 diabetic retinopathy annotations with 1,009 semantic segmentations on OCT and reached the same classification accuracy using only 25% of fundus images, without any drawbacks, since OCT is not required for predictions. We expect this concept will also apply to other multimodal clinical data-imaging, health records, and genomics data, and be applicable to corresponding sample-starved learning problems.
#######
Facebook
TwitterThis dataset provides the expected and determined concentrations of selected inorganic and organic analytes for spiked reagent-water samples (calibration standards and limit of quantitation standards) that were used to calculate detection limits by using the United States Environmental Protection Agency’s (USEPA) Method Detection Limit (MDL) version 1.11 or 2.0 procedures, ASTM International’s Within-Laboratory Critical Level standard procedure D7783-13, and, for five pharmaceutical compounds, by USEPA’s Lowest Concentration Minimum Reporting Level procedure. Also provided are determined concentration data for reagent-water laboratory blank samples, classified as either instrument blank or set blank samples, and reagent-water blind-blank samples submitted by the USGS Quality System Branch, that were used to calculate blank-based detection limits by using the USEPA MDL version 2.0 procedure or procedures described in National Water Quality Laboratory Technical Memorandum 2016.02, http://wwwnwql.cr.usgs.gov/tech_memos/nwql.2016-02.pdf. The determined detection limits are provided and compared in the related external publication at https://doi.org/10.1016/j.talanta.2021.122139.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the Point Blank population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Point Blank. The dataset can be utilized to understand the population distribution of Point Blank by age. For example, using this dataset, we can identify the largest age group in Point Blank.
Key observations
The largest age group in Point Blank, TX was for the group of age 60 to 64 years years with a population of 106 (13.04%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Point Blank, TX was the 10 to 14 years years with a population of 6 (0.74%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Point Blank Population by Age. You can refer the same here
Facebook
TwitterMetric and attribute data for blank sample (LU2 and LU3).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf
In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in:
if you found this dataset useful for your research, please cite:
@article{ji2022amos,
title={AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation},
author={Ji, Yuanfeng and Bai, Haotian and Yang, Jie and Ge, Chongjian and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhang, Lingyan and Ma, Wanling and Wan, Xiang and others},
journal={arXiv preprint arXiv:2206.08023},
year={2022}
}
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Conventional paper currency and modern electronic currency are two important modes of transactions. In several parts of the world, conventional methodology has clear precedence over its electronic counterpart. However, the identification of forged currency paper notes is now becoming an increasingly crucial problem because of the new and improved tactics employed by counterfeiters. In this paper, a machine assisted system – dubbed DeepMoney– is proposed which has been developed to discriminate fake notes from genuine ones. For this purpose, state-of-the-art models of machine learning called Generative Adversarial Networks (GANs) are employed. GANs use an unsupervised learning to train a model that can then be used to perform supervised predictions. This flexibility provides the best of both worlds by allowing unlabelled data to be trained on whilst still making concrete predictions. This technique was applied to Pakistani banknotes. State-of-the-art image processing and feature recognition techniques were used to design the overall approach of a valid input. Augmented samples of images were used in the experiments which show that a high-precision machine can be developed to recognize genuine paper money. An accuracy of 80% has been achieved. The code is available as an open source to allow others to reproduce and build upon the efforts already made.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In the field of polymer informatics, utilizing machine learning (ML) techniques to evaluate the glass transition temperature Tg and other properties of polymers has attracted extensive attention. This data-centric approach is much more efficient and practical than the laborious experimental measurements when encountered a daunting number of polymer structures. Various ML models are demonstrated to perform well for Tg prediction. Nevertheless, they are trained on different data sets, using different structure representations, and based on different feature engineering methods. Thus, the critical question arises on selecting a proper ML model to better handle the Tg prediction with generalization ability. To provide a fair comparison of different ML techniques and examine the key factors that affect the model performance, we carry out a systematic benchmark study by compiling 79 different ML models and training them on a large and diverse data set. The three major components in setting up an ML model are structure representations, feature representations, and ML algorithms. In terms of polymer structure representation, we consider the polymer monomer, repeat unit, and oligomer with longer chain structure. Based on that feature, representation is calculated, including Morgan fingerprinting with or without substructure frequency, RDKit descriptors, molecular embedding, molecular graph, etc. Afterward, the obtained feature input is trained using different ML algorithms, such as deep neural networks, convolutional neural networks, random forest, support vector machine, LASSO regression, and Gaussian process regression. We evaluate the performance of these ML models using a holdout test set and an extra unlabeled data set from high-throughput molecular dynamics simulation. The ML model’s generalization ability on an unlabeled data set is especially focused, and the model’s sensitivity to topology and the molecular weight of polymers is also taken into consideration. This benchmark study provides not only a guideline for the Tg prediction task but also a useful reference for other polymer informatics tasks.
Facebook
TwitterThe prediction of response to drugs before initiating therapy based on transcriptome data is a major challenge. However, identifying effective drug response label data costs time and resources. Methods available often predict poorly and fail to identify robust biomarkers due to the curse of dimensionality: high dimensionality and low sample size. Therefore, this necessitates the development of predictive models to effectively predict the response to drugs using limited labeled data while being interpretable. In this study, we report a novel Hierarchical Graph Random Neural Networks (HiRAND) framework to predict the drug response using transcriptome data of few labeled data and additional unlabeled data. HiRAND completes the information integration of the gene graph and sample graph by graph convolutional network (GCN). The innovation of our model is leveraging data augmentation strategy to solve the dilemma of limited labeled data and using consistency regularization to optimize the prediction consistency of unlabeled data across different data augmentations. The results showed that HiRAND achieved better performance than competitive methods in various prediction scenarios, including both simulation data and multiple drug response data. We found that the prediction ability of HiRAND in the drug vorinostat showed the best results across all 62 drugs. In addition, HiRAND was interpreted to identify the key genes most important to vorinostat response, highlighting critical roles for ribosomal protein-related genes in the response to histone deacetylase inhibition. Our HiRAND could be utilized as an efficient framework for improving the drug response prediction performance using few labeled data.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:
Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.
These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.