Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:
Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.
These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
I created this dataset as part of a data analysis project and concluded that it might be relevant for others who are interested in examining in analyzing content on YouTube. This dataset is a collection of over 6000 videos having the columns:
Comments: comments count for the video
Through the YouTube API and using Python, I collect data about some of these popular channels' videos that provide educational content about Machine Learning and Data Science in order to extract insights about which topics had been popular within the last couple of years. Featured in the dataset are the following creators:
Krish Naik
Nicholas Renotte
Sentdex
DeepLearningAI
Artificial Intelligence — All in One
Siraj Raval
Jeremy Howard
Applied AI Course
Daniel Bourke
Jeff Heaton
DeepLearning.TV
Arxiv Insights
These channels are features in multiple top AI channels to subscribe to lists and have seen a big growth in the last couple of years on YouTube. They all have a creation date since or before 2018.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains samples 17 - 24 from the data collection described in
Henri Der Sarkissian, Felix Lucka, Maureen van Eijnatten, Giulia Colacicco, Sophia Bethany Coban, Kees Joost Batenburg, "A Cone-Beam X-Ray CT Data Collection Designed for Machine Learning", Sci Data 6, 215 (2019). https://doi.org/10.1038/s41597-019-0235-y or arXiv:1905.04787 (2019)
Abstract:
"Unlike previous works, this open data collection consists of X-ray cone-beam (CB) computed tomography (CT) datasets specifically designed for machine learning applications and high cone-angle artefact reduction: Forty-two walnuts were scanned with a laboratory X-ray setup to provide not only data from a single object but from a class of objects with natural variability. For each walnut, CB projections on three different orbits were acquired to provide CB data with different cone angles as well as being able to compute artefact-free, high-quality ground truth images from the combined data that can be used for supervised learning. We provide the complete image reconstruction pipeline: raw projection data, a description of the scanning geometry, pre-processing and reconstruction scripts using open software, and the reconstructed volumes. Due to this, the dataset can not only be used for high cone-angle artefact reduction but also for algorithm development and evaluation for other tasks, such as image reconstruction from limited or sparse-angle (low-dose) scanning, super resolution, or segmentation."
The scans are performed using a custom-built, highly flexible X-ray CT scanner, the FleX-ray scanner, developed by XRE nvand located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. The general purpose of the FleX-ray Lab is to conduct proof of concept experiments directly accessible to researchers in the field of mathematics and computer science. The scanner consists of a cone-beam microfocus X-ray point source that projects polychromatic X-rays onto a 1536-by-1944 pixels, 14-bit flat panel detector (Dexella 1512NDT) and a rotation stage in-between, upon which a sample is mounted. All three components are mounted on translation stages which allow them to move independently from one another.
Please refer to the paper for all further technical details.
The complete data set can be found via the following links: 1-8, 9-16, 17-24, 25-32, 33-37, 38-42
The corresponding Python scripts for loading, pre-processing and reconstructing the projection data in the way described in the paper can be found on github
For more information or guidance in using these dataset, please get in touch with
Facebook
TwitterOur People data is gathered and aggregated via surveys, digital services, and public data sources. We use powerful profiling algorithms to collect and ingest only fresh and reliable data points.
Our comprehensive data enrichment solution includes a variety of data sets that can help you address gaps in your customer data, gain a deeper understanding of your customers, and power superior client experiences.
People Data Schema & Reach: Our data reach represents the total number of counts available within various categories and comprises attributes such as country location, MAU, DAU & Monthly Location Pings:
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method on a suitable interval (daily/weekly/monthly).
People data Use Cases:
360-Degree Customer View: Get a comprehensive image of customers by the means of internal and external data aggregation. Data Enrichment: Leverage Online to offline consumer profiles to build holistic audience segments to improve campaign targeting using user data enrichment Fraud Detection: Use multiple digital (web and mobile) identities to verify real users and detect anomalies or fraudulent activity. Advertising & Marketing: Understand audience demographics, interests, lifestyle, hobbies, and behaviors to build targeted marketing campaigns.
Here's the schema of People Data:
person_id
first_name
last_name
age
gender
linkedin_url
twitter_url
facebook_url
city
state
address
zip
zip4
country
delivery_point_bar_code
carrier_route
walk_seuqence_code
fips_state_code
fips_country_code
country_name
latitude
longtiude
address_type
metropolitan_statistical_area
core_based+statistical_area
census_tract
census_block_group
census_block
primary_address
pre_address
streer
post_address
address_suffix
address_secondline
address_abrev
census_median_home_value
home_market_value
property_build+year
property_with_ac
property_with_pool
property_with_water
property_with_sewer
general_home_value
property_fuel_type
year
month
household_id
Census_median_household_income
household_size
marital_status
length+of_residence
number_of_kids
pre_school_kids
single_parents
working_women_in_house_hold
homeowner
children
adults
generations
net_worth
education_level
occupation
education_history
credit_lines
credit_card_user
newly_issued_credit_card_user
credit_range_new
credit_cards
loan_to_value
mortgage_loan2_amount
mortgage_loan_type
mortgage_loan2_type
mortgage_lender_code
mortgage_loan2_render_code
mortgage_lender
mortgage_loan2_lender
mortgage_loan2_ratetype
mortgage_rate
mortgage_loan2_rate
donor
investor
interest
buyer
hobby
personal_email
work_email
devices
phone
employee_title
employee_department
employee_job_function
skills
recent_job_change
company_id
company_name
company_description
technologies_used
office_address
office_city
office_country
office_state
office_zip5
office_zip4
office_carrier_route
office_latitude
office_longitude
office_cbsa_code
office_census_block_group
office_census_tract
office_county_code
company_phone
company_credit_score
company_csa_code
company_dpbc
company_franchiseflag
company_facebookurl
company_linkedinurl
company_twitterurl
company_website
company_fortune_rank
company_government_type
company_headquarters_branch
company_home_business
company_industry
company_num_pcs_used
company_num_employees
company_firm_individual
company_msa
company_msa_name
company_naics_code
company_naics_description
company_naics_code2
company_naics_description2
company_sic_code2
company_sic_code2_description
company_sic_code4
company_sic_code4_description
company_sic_code6
company_sic_code6_description
company_sic_code8
company_sic_code8_description
company_parent_company
company_parent_company_location
company_public_private
company_subsidiary_company
company_residential_business_code
company_revenue_at_side_code
company_revenue_range
company_revenue
company_sales_volume
company_small_business
company_stock_ticker
company_year_founded
company_minorityowned
company_female_owned_or_operated
company_franchise_code
company_dma
company_dma_name
company_hq_address
company_hq_city
company_hq_duns
company_hq_state
company_hq_zip5
company_hq_zip4
company_se...
Facebook
TwitterBats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Facebook
TwitterTwo field research projects focused on cotton were conducted at the El Reno, OK and Temple, TX USDA-ARS research stations. The study at the Southern Plains research station took place in 2019 where cotton cultivars included: FiberMax 1830 GLT, FiberMax 1888 GL, FiberMax 2498 GLT, Phytogen 300 W3FE, Phytogen 350 W3FE, and Phytogen 490 W3FE. The experiment took place on Norge silt loam soil. The experimental design included a randomized complete block design with three replications and four nitrogen treatments: unfertilized, cover crop (Lathyrus sativus) with 0, 30, or 60 kg/ha of inorganic nitrogen. The experiment was planted at a seeding rate of 12 seeds/m on May 28, 2019. The study at the Texas Gulf station incorporated four cultivars: DP 1646 B2XF, DP 2012 B3XF, DP 2239 B3XF, and DP 1646 B2XF. The experiment took place on Houston Black clay soils. The experimental design included 6-row strip plots with 0.76 m row spacings and approximately 00.1 m plant spacing at a seeding rate of 13 seeds per meter. The experiment also included a multi-planting date design that included March 28, April 11, April 22, and May 11.These two stations collected in-situ measurements of height, node count, leaf area index (LAI), canopy cover percentage, and chlorophyll content. LAI data were collected using the LAI-2200C Plant Canopy Analyzer (LI-COR, Lincoln, NE, USA). Canopy cover data were collected using the app “Canopeo” where a ratio of plant to ground pixels were calculated. Chlorophyll content data were collected using Chlorophyll Content Meter-300 (Opti-Sciences, Hudson, NH, USA) where the unit of measure was mg/m2. Directly following these measurements a spectroradiometer (FieldSpec Pro FR: Malvern Panalytical, Westborough, MA, USA) was utilized to collect hyperspectral data from 350nm to 2500nm. The locations of both the in-situ and hyperspectral data collections were chosen at random throughout the field projects and took place in 2019 (OK) and 2022 (TX) with specific dates of collection included in the database.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
What this collection is: A curated, binary-classified image dataset of grayscale (1 band) 400 x 400-pixel size, or image chips, in a JPEG format extracted from processed Sentinel-1 Synthetic Aperture Radar (SAR) satellite scenes acquired over various regions of the world, and featuring clear open ocean chips, look-alikes (wind or biogenic features) and oil slick chips.
This binary dataset contains chips labelled as:
- "0" for chips not containing any oil features (look-alikes or clean seas)
- "1" for those containing oil features.
This binary dataset is imbalanced, and biased towards "0" labelled chips (i.e., no oil features), which correspond to 66% of the dataset. Chips containing oil features, labelled "1", correspond to 34% of the dataset.
Why: This dataset can be used for training, validation and/or testing of machine learning, including deep learning, algorithms for the detection of oil features in SAR imagery. Directly applicable for algorithm development for the European Space Agency Sentinel-1 SAR mission (https://sentinel.esa.int/web/sentinel/missions/sentinel-1 ), it may be suitable for the development of detection algorithms for other SAR satellite sensors.
Overview of this dataset: Total number of chips (both classes) is N=5,630 Class 0 1 Total 3,725 1,905
Further information and description is found in the ReadMe file provided (ReadMe_Sentinel1_SAR_OilNoOil_20221215.txt)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used for training a Machine Learning model to classify tennis serves into “In”, “Out” and “Net” as well as predict the outcome coordinates of the serve. A marker-based motion capture system provided and operated by The University of Bath’s Applied BioMechanics Suite was used to collect spatio-temporal data on participants completing tennis serves, whilst a high-speed video camera recorded the outcome. Dataset has two parts: 1) The spatio-temporal data used for training and validation 2) The serve outcome and coordinates for labelling of the data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.
Our research addresses three main aspects:
This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.
We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:
To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.
Commit Diffs
We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.
Commit Classification
We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.
Model Metadata
We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.
The replication package is organized as follows:
- code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.
HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform.HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases.HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.HFCommitsPreprocessing.ipynb: Processes commit data, including:
HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis.RQ1_Analysis.ipynb: Analysis for RQ1.RQ2_Analysis.ipynb: Analysis for RQ2.RQ3_Analysis.ipynb: Analysis for RQ3.- datasets/: Contains the raw, processed, and manually curated datasets used for the analysis.
HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases.HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb.HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing.- metadata/: Contains the tags_metadata.yaml file used during preprocessing.
- models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.
- requirements.txt: Lists the required Python packages to set up the environment and run the code.
bashpython -m venv venvsource venv/bin/activate # On Windows, use venv\Scripts\activatebashpip install -r requirements.txt- LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.
- Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.
- Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.
Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.
This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this work is composed by four participants, two men and two women. Each of them carried the wearable device Empatica E4 for a total number of 15 days. They carried the wearable during the day, and during the nights we asked participants to charge and load the data into an external memory unit. During these days, participants were asked to answer EMA questionnaires which are used to label our data. However, some participants could not complete the full experiment or some days were discarded due to data corruption. Specific demographic information, total sampling days and total number of EMA answers can be found in table I.
| Participant 1 | Participant 2 | Participant 3 | Participant 4 | |
|---|---|---|---|---|
| Age | 67 | 55 | 60 | 63 |
| Gender | Male | Female | Male | Female |
|
Final Valid Days | 9 | 15 | 12 | 13 |
| Total EMAs | 42 | 57 | 64 | 46 |
Table I. Summary of participants' collected data.
This dataset provides three different type of labels. Activeness and happiness are two of these labels. These are the answers to EMA questionnaires that participants reported during their daily activities. These labels are numbers between 0 and 4.
These labels are used to interpolate the mental well-being state according to [1] We report in our dataset a total number of eight emotional states: (1) pleasure, (2) excitement, (3) arousal, (4) distress, (5) misery, (6) depression, (7) sleepiness, and (8) contentment.
The data we provide in this repository consist of two type of files:
NOTE: Files are numbered according to each specific sampling day. For example, ACC1.csv corresponds to the signal ACC for sampling day 1. The same applied to excel files.
Code and a tutorial of how to labelled and extract features can be found in this repository: https://github.com/edugm94/temporal-feat-emotion-prediction
References:
[1] . A. Russell, “A circumplex model of affect,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980
Facebook
TwitterThe Film dataset is a large collection of audio files with full metadata, including chords, instrumentation, key, tempo, and timestamps. This dataset is designed for machine learning applications and serves as a reliable resource for generative AI music, Music Information Retrieval (MIR), and source separation. With an emphasis on expanding machine learning attempts, the dataset allows researchers to delve into the complexities of film music, enabling the development of algorithms capable of generating creative compositions that genuinely represent the emotive nuances of various genres.
Film music, an essential component of cinematic storytelling, plays an important role in increasing spectator engagement and emotional resonance. Composers work collaboratively with filmmakers to create music that enhance visual aspects, set the tone, and reinforce story themes.
Training models on this cinema dataset allows researchers to better grasp and mimic these artistic details, extending the bounds of AI-generated music and contributing to advances in MIR and source separation.
Facebook
TwitterSubsurface data analysis, reservoir modeling, and machine learning (ML) techniques have been applied to the Brady Hot Springs (BHS) geothermal field in Nevada, USA to further characterize the subsurface and assist with optimizing reservoir management. Hundreds of reservoir simulations have been conducted in TETRAD-G and CMG STARS to explore different injection and production fluid flow rates and allocations and to develop a training data set for ML. This process included simulating the historical injection and production since 1979 and prediction of future performance through 2040. ML networks were created and trained using TensorFlow based on multilayer perceptron, long short-term memory, and convolutional neural network architectures. These networks took as input selected flow rates, injection temperatures, and historical field operation data and produced estimates of future production temperatures. This approach was first successfully tested on a simplified single-fracture doublet system, followed by the application to the BHS reservoir. Using an initial BHS data set with 37 simulated scenarios, the trained and validated network predicted the production temperature for six production wells with the mean absolute percentage error of less than 8%. In a complementary analysis effort, the principal component analysis applied to 13 BHS geological parameters revealed that vertical fracture permeability shows the strongest correlation with fault density and fault intersection density. A new BHS reservoir model was developed considering the fault intersection density as proxy for permeability. This new reservoir model helps to explore underexploited zones in the reservoir. A data gathering plan to obtain additional subsurface data was developed; it includes temperature surveying for three idle injection wells at which the reservoir simulations indicate high bottom-hole temperatures. The collected data assist with calibrating the reservoir model. Data gathering activities are planned for the first quarter of 2021. This GDR submission includes a preprint of the paper titled "Subsurface Characterization and Machine Learning Predictions at Brady Hot Springs" presented at the 46th Stanford Geothermal Workshop (SGW) on Geothermal Reservoir Engineering from February 16-18, 2021.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Summary
A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains
30 completely labeled (segmented) images
71 partly labeled images
altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)
To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects
A set of metrics and a novel ranking score for respective meaningful method benchmarking
An evaluation of three baseline methods in terms of the above metrics and score
Abstract
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
Dataset documentation:
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
FISBe Datasheet
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Files
fisbe_v1.0_{completely,partly}.zip
contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.
fisbe_v1.0_mips.zip
maximum intensity projections of all samples, for convenience.
sample_list_per_split.txt
a simple list of all samples and the subset they are in, for convenience.
view_data.py
a simple python script to visualize samples, see below for more information on how to use it.
dim_neurons_val_and_test_sets.json
a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.
Readme.md
general information
How to work with the image files
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env
How to open zarr files
Install the python zarr package:
pip install zarr
Opened a zarr file with:
import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")
Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.
How to view zarr image files
We recommend to use napari to view the image data.
Install napari:
pip install "napari[all]"
Save the following Python script:
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()
Execute:
python view_data.py /R9F03-20181030_62_B5.zarr
Metrics
S: Average of avF1 and C
avF1: Average F1 Score
C: Average ground truth coverage
clDice_TP: Average true positives clDice
FS: Number of false splits
FM: Number of false merges
tp: Relative number of true positives
For more information on our selected metrics and formal definitions please see our paper.
Baseline
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.
License
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Citation
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }
Acknowledgments
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.
Changelog
There have been no changes to the dataset so far.All future change will be listed on the changelog page.
Contributing
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We built a crawler to collect data from the Google Play store including the application's metadata and APK files. The manifest files were extracted from the APK files and then processed to extract the features. The data set is composed of 870,515 records/apps, and for each app we produced 48 features. The data set was used to built and test two bootstrap aggregating of multiple XGBoost machine learning classifiers. The dataset were collected between April 2017 and November 2018. We then checked the status of these applications on three different occasions; December 2018, February 2019, and May-June 2019.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset is designed to facilitate the development and evaluation of machine learning models for managing and distributing traditional cultural and educational resources, leveraging IoT data and deep learning techniques. The goal is to provide a rich set of features for exploring user interactions with cultural artifacts and historical sites, with the support of IoT-enabled sensors and virtual learning experiences.
Columns Overview: User_ID: A unique identifier for each user interacting with the system. This helps track individual behavior patterns and interactions.
Artifact_ID: The unique identifier for each artifact (e.g., a cultural artifact or historical item) that is part of the learning experience.
Artifact_Name: The name of the artifact. This could range from physical objects like sculptures, to digital representations of ancient items or sites.
Artifact_Type: A categorical field indicating the type of artifact, such as sculpture, musical instrument, religious artifact, etc.
Artifact_Location: The physical or virtual location of the artifact. This could be a museum, historical site, or an online platform.
Artifact_Historical_Significance: A textual description of the cultural and historical importance of the artifact. This can be used to provide more context for the artifact’s significance.
Digital_Representation_URL: A URL pointing to the digital representation of the artifact, which could be an image, video, or audio file related to the artifact.
Artifact_Tags: Tags that describe the artifact in terms of categories like "Ancient", "Religious", "Musical", etc. These can be used for classification or recommendations.
Site_ID: The unique identifier for each site (e.g., the location where a cultural artifact is located or a historical site).
Site_Name: The name of the site, such as "Great Wall of China" or "Colosseum".
Site_Location: The geographical coordinates (latitude and longitude) of the site.
Site_Cultural_Importance: A description of the cultural importance of the site.
Virtual_Tour_URL: A URL linking to a virtual tour of the site or artifact, offering an interactive learning experience.
IoT_Sensor_ID: The identifier of the IoT sensor collecting data from the artifact or site. These sensors track various environmental factors such as temperature, humidity, and motion.
IoT_Sensor_Type: The type of sensor used to collect data, such as "Temperature", "Humidity", "Motion", or "Light".
IoT_Sensor_Location: The location of the IoT sensor in relation to the artifact or site.
Sensor_Timestamp: The date and time when the sensor data was recorded.
Sensor_Reading: The actual value recorded by the IoT sensor. This could represent temperature, humidity, motion detection, or other environmental factors.
User_Interaction_Type: The type of interaction the user has with the artifact or site. This can include actions like "View", "Rate", or "Click".
User_Interaction_Timestamp: The timestamp of when the user interaction took place. This allows us to track the time of day and the frequency of interactions.
Interaction_Duration: The duration of the user’s interaction with the artifact or site, measured in minutes. This helps in analyzing user engagement levels.
User_Feedback: The feedback provided by the user after interacting with the artifact or site, such as "Liked", "Not Interested", or "Neutral".
Target_Column: The target variable for machine learning models. This column indicates whether the user had a positive interaction (1) or a negative interaction (0). A positive interaction is defined as a "view" action, while negative interactions include other actions like "rate" or "click".
Facebook
TwitterMonitoring asthma condition is essential to asthma self-management. However, traditional methods of monitoring require high levels of active engagement and patients may regard this level of monitoring as tedious. Passive monitoring with mobile health devices, especially when combined with machine learning, provides an avenue to dramatically reduce management burden. However, data for developing machine learning algorithms are scarce, and gathering new data is expensive. A few asthma mHealth datasets are publicly available, but lack objective and passively collected data which may enhance asthma attack prediction systems. To fill this gap, we carried out the 2-phase, 7-month AAMOS-00 observational study to collect data about asthma status using three smart monitoring devices (smart peak flow meter, smart inhaler, smartwatch), and daily symptom questionnaires. Combined with localised weather, pollen, and air quality reports, we have collected a rich longitudinal dataset to explore the feasibility of passive monitoring and asthma attack prediction. Conducting phase 2 of device monitoring over 12 months, from June 2021 to June 2022 and during the COVID-19 pandemic, 22 participants across the UK provided 2,054 unique patient-days of data. This valuable anonymised dataset has been made publicly available with the consent of participants. Ethics approval was provided by the East of England - Cambridge Central Research Ethics Committee. IRAS project ID: 285505 with governance approval from ACCORD (Academic and Clinical Central Office for Research and Development), project number: AC20145. The study sponsor was ACCORD, the University of Edinburgh. The anonymised dataset was produced with statistical advice from Aryelly Rodriguez - Statistician, Edinburgh Clinical Trials Unit, University of Edinburgh. Protocol: 'Predicting asthma attacks using connected mobile devices and machine learning; the AAMOS-00 observational study protocol' - BMJ Open, DOI: 10.1136/bmjopen-2022-064166 # Thesis # Tsang, Kevin CH 'Application of data-driven technologies for asthma self-management' (2022) [Doctoral Thesis] University of Edinburgh https://era.ed.ac.uk/handle/1842/40547 The dataset also relates to the publication K.C.H. Tsang, H. Pinnock, A.M. Wilson, D. Salvi and S.A. Shah (2023). 'Home monitoring with connected mobile devices for asthma attack prediction with machine learning', Scientific Data 10 ( https://doi.org/10.1038/s41597-023-02241-9 ).
Facebook
TwitterThis paper presents Hit2Flux, a machine learning framework for boiling heat flux prediction using acoustic emission (AE) hits generated through threshold-based transient sampling. Unlike continuously sampled data, AE hits are recorded when the signal exceeds a predefined threshold and are thus discontinuous in nature. Meanwhile, each hit represents a waveform at a high sampling frequency (∼1 MHz). In order to capture the features of both the high-frequency waveforms and the temporal distribution of hits, Hit2Flux involves i) feature extraction by transforming AE hits into the frequency domain and organizing these spectra into sequences using a rolling window to form “sequences-of-sequences,†and ii) heat flux prediction using a long short-term memory (LSTM) network with sequences of sequences. The model is trained on AE hits recorded during pool boiling experiments using an AE sensor attached to the boiling chamber. Continuously sampled acoustic data using a hydrophone were also collect..., , # Data for: Hit2flux: A machine learning framework for boiling heat flux prediction using hit-based acoustic emission sensing
Dataset DOI: 10.5061/dryad.g79cnp628
This dataset includes acoustic emission hit data and waveforms, hydrophone data, pressure, and temperature data from transient pool boiling tests on copper microchannels. The pool boiling test facility includes (a) a heating element that consists of a copper block and cartridge heaters; (b) a closed chamber with flow loops for a chiller (Thermo Scientific Polar ACCEL 500 Low/EA) connecting Graham condenser (Ace glass 5953-106) with an adapter (Ace glass 5838-76) and an in-house built coiled copper condenser; and (c) a synchronized multimodal sensing system. The copper block is submerged in deionized water and heated by nine cartridge heaters (Omega Engineering HDC19102), each with a power rating of 50 W, inserted from the bottom. The cartridge heaters ...,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset comprises novel aspects specifically, in terms of student grading in diverse educational cultures within the multiple countries – Researchers and other education sectors will be able to see the impact of having varied curriculums in a country. Dataset compares different levelling cases when student transfer from curriculum to curriculum and the unreliable levelling criteria set by schools currently in an international school. The collected data can be used within the intelligent algorithms specifically machine learning and pattern analysis methods, to develop an intelligent framework applicable in multi-cultural educational systems to aid in a smooth transition “levelling, hereafter” of students who relocate from a particular education curriculum to another; and minimize the impact of switching on the students’ educational performance. The preliminary variables taken into consideration when deciding which data to collect depended on the variables. UAE is a multicultural country with many expats relocating from regions such as Asia, Europe and America. In order to meet expats needs, UAE has established many international private schools, therefore UAE was chosen to be the location of study based on many cases and struggles in levelling declared by the Ministry of Education and schools. For the first time, we present this dataset comprising students’ records for two academic years that included math, English, and science for 3 terms. Selection of subject areas and number of terms was based on influence from other researchers in similar subject matters.
Facebook
TwitterThis dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.
Dataset description
Parquet file, with:
The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.
Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.
File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.
The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.
The data subset used in this work comprises the following:
From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).
The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:
Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.
These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.