Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Facebook
TwitterNeural Net Model Parameter files for my kernel https://www.kaggle.com/whatsthevariance/diagnosing-skin-cancer-with-bagged-neural-nets
350 EPOCHS worth of training can be loaded with these files using PyTorch
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corresponding author: Peng Hou (houpcy@163.com)Abstract: The Tibetan Plateau (TP), one of the most climate-sensitive regions on Earth, plays a crucial role in global carbon cycling. However, the spatiotemporal variability of modeled above- and below-ground net primary production (ANPP and BNPP) remain uncertain across linear (LL), machine learning (ML), and deep learning (DL) models, particularly for BNPP. To address this gap, we applied 96 data-driven models, including LL, ML, and DL approaches, combined with 5-fold cross-validation and Monte Carlo simulations to estimate ANPP and BNPP at 1 km resolution from 1981 to 2018 across the TP. The result showed that the best-performing models for ANPP and BNPP achieved R2 values ranging from 0.80 to 0.88 for ANPP and from 0.89 to 0.95 for BNPP. Spatiotemporal patterns of ANPP and BNPP were generally consistent across model types. However, total ANPP exhibited a significant declining trend at −0.003 Pg C yr−¹, while BNPP increased by 0.001 to 0.003 Pg C yr−¹. Notably, inter-model variability in annual totals reached up to 0.13 and 0.32 Pg C yr−¹ for ANPP and BNPP, respectively. These discrepancies likely stem from differences in how models interpret input variable contributions, as reflected in distinct spatial patterns—particularly in DL simulations, which showed divergence in ANPP across southern TP (e.g., Nyingchi) and BNPP in northern to central regions (e.g., from Xining to Zhiduo). Our findings offer a robust methodological benchmark for modeling ecosystem carbon allocation under climate change and provide valuable insights for adaptive carbon management in one of the world’s most vulnerable regions.Filename: ANPP_xgblinear_Tibetan _1981-2018_1km_tif.zip; ANPP_Rborist_Tibetan _1981-2018_1km_tif.zip; ANPP_HYFIS_Tibetan _1981-2018_1km_tif.zip; BNPP_xgblinear_Tibetan _1981-2018_1km_tif.zip; BNPP_xgbDART_Tibetan _1981-2018_1km_tif.zip; BNPP_monmlp_Tibetan _1981-2018_1km_tif.zip.File information:The names of each Zip compressed file are composed of the observation object, simulation model, region, time range, spatial resolution, and data format. For example, 'ANPP_xgblinear_Tibetan_1981 - 2018_1km_tif.zip' consists of ANPP (observation object) + '_' + xgblinear (simulation model) + '_' + Tibetan (region) + '_' + 1981-2018 (time range) + '_' + 1km (spatial resolution) + '_' + tif (data format) + '.zip'. Among them, ANPP is Aboveground Net Primary Production; BNPP is Belowground Net Primary Production; xgbLinear is a linear model, Rborist/xgbTree are machine learning models of the linear model, and HYFIS/monmlp are a deep learning model.The unit for these data is 'g C m-2 yr-1'.Author contributions: Tao Zhou contributed to the conceptualization, methodology, software, and writing - original draft, review, editing; Benjamin Laffitte, Jianfei Cao, Xuwei Sun and Guangjin Zhou supervised manuscript writing; Yuting Hou contributed to the data curation and software; contributed to; Peng Hou contributed to data curation, writing - original draft preparation, software, and writing–review, editing; all authors contributed to the final preparation of the manuscript.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is extended datasets from MM-IMDB [Arevalo+ ICLRW'17], Ads-Parallelity [Zhang+ BMVC'18] dataset with the features from Google Cloud Vision API. These datasets are stored in jsonl (JSON Lines) format.
Abstract (from our paper):
There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DM2S2). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further. Our concept exhibits performance that is comparable to or better than the previous set-aware models. Furthermore, we demonstrate that the visualization of the learned InterMRA and IntraMRA weights can provide an interpretation of the prediction results.
Dataset (MM-IMDB and Ads-Parallelity):
We extended two multimodal datasets, namely, MM-IMDB [Arevalo+ ICLRW'17], Ads-Parallelity [Zhang+ BMVC'18] for the empirical experiments. The MM-IMDB dataset contains 25,925 movies with multiple labels (genres). We used the original split provided in the dataset and reported the F1 scores (micro, macro, and samples) of the test set. The Ads-Parallelity dataset contains 670 images and slogans from persuasive advertisements to understand the implicit relationship (parallel and non-parallel) between these two modalities. A binary classification task is used to predict whether the text and image in the same ad convey the same message.
We transformed the following multimodal information (i.e., visual, textual, and categorical data) into textual tokens and fed these into our proposed model. We used the Google Cloud Vision API for the visual features to obtain the following four pieces of information as tokens: (1) text from the OCR, (2) category labels from the label detection, (3) object tags from the object detection, and (4) the number of faces from the facial detection. We input the labels and object detection results as a sequence in order of confidence, as obtained from the API. We describe the visual, textual, and categorical features of each dataset below.
MM-IMDB: We used the title and plot of movies as the textual features, and the aforementioned API results based on poster images as visual features.
Ads-Parallelity: We used the same API-based visual features as in MM-IMDB. Furthermore, we used textual and categorical features consisting of textual inputs of transcriptions and messages, and categorical inputs of natural and text concrete images.
Facebook
TwitterThis an example data source which can be used for Predictive Maintenance Model Building. It consists of the following data:
Telemetry Time Series Data (PdM_telemetry.csv): It consists of hourly average of voltage, rotation, pressure, vibration collected from 100 machines for the year 2015.
Error (PdM_errors.csv): These are errors encountered by the machines while in operating condition. Since, these errors don't shut down the machines, these are not considered as failures. The error date and times are rounded to the closest hour since the telemetry data is collected at an hourly rate.
Maintenance (PdM_maint.csv): If a component of a machine is replaced, that is captured as a record in this table. Components are replaced under two situations: 1. During the regular scheduled visit, the technician replaced it (Proactive Maintenance) 2. A component breaks down and then the technician does an unscheduled maintenance to replace the component (Reactive Maintenance). This is considered as a failure and corresponding data is captured under Failures. Maintenance data has both 2014 and 2015 records. This data is rounded to the closest hour since the telemetry data is collected at an hourly rate.
Failures (PdM_failures.csv): Each record represents replacement of a component due to failure. This data is a subset of Maintenance data. This data is rounded to the closest hour since the telemetry data is collected at an hourly rate.
Metadata of Machines (PdM_Machines.csv): Model type & age of the Machines.
This dataset was available as a part of Azure AI Notebooks for Predictive Maintenance. But as of 15th Oct, 2020 the notebook (link) is no longer available. However, the data can still be downloaded using the following URLs:
https://azuremlsampleexperiments.blob.core.windows.net/datasets/PdM_telemetry.csv https://azuremlsampleexperiments.blob.core.windows.net/datasets/PdM_errors.csv https://azuremlsampleexperiments.blob.core.windows.net/datasets/PdM_maint.csv https://azuremlsampleexperiments.blob.core.windows.net/datasets/PdM_failures.csv https://azuremlsampleexperiments.blob.core.windows.net/datasets/PdM_machines.csv
Try to use this data to build Machine Learning models related to Predictive Maintenance.
Facebook
TwitterThe following are specific objectives for the spring CalCOFI. Continuously sample pelagic fish eggs using the Continuous Underway Fish Egg Sampler (CUFES). The data will be used to estimate the distributions and abundances of spawning hake, anchovy, mackerel, and early spawning pacific sardine. Continuously sample multi-frequency acoustic backscatter using the Simrad EK80 and the Simrad ME70. The data will be used to estimate the distributions and abundances of coastal pelagic fishes (e.g., sardine, anchovy, and mackerel), and krill species. Continuously sample sea-surface temperature, salinity, and chlorophyll-a using a thermosalinometer and fluorometer. These data will be used to estimate the physical oceanographic habitats for target species. Continuously sample air temperature, barometric pressure, and wind speed and direction using an integrated weather station. Sample profiles of seawater temperature, salinity, chlorophyll-a, nutrients, and phytoplankton using a CTD with water-sampling rosette and other instruments at prescribed stations. Measurements of extracted chlorophyll and phaeophytin will be obtained with a fluorometer. Nutrients will be measured with an auto-analyzer. Sample the light intensity in the photic zone using a standard secchi disk in conjunction with a daytime CTD station. Sample plankton using a CalBOBL (CalCOFI Bongo Oblique) at prescribed stations. These data will be used to estimate the distributions and abundances of ichthyoplankton and zooplankton species. Sample plankton using a Manta (neuston) net at prescribed stations. These data will be used to estimate the distributions and abundances of ichthyoplankton species. Sample the vertically integrated abundance of fish eggs using a Pairovet net at prescribed stations. These data will be used to quantify the abundances and distributions of fish eggs. Sample plankton using a PRPOOS (Planktonic Rate Processes in Oligotrophic Ocean Systems) net at all prescribed CalCOFI stations on lines 90.0 and 80.0 as well as stations out to and including station 70.0 on lines 86.7 and 83.3 and station 81.8 46.9. PRPOOS will not be towed on SCCOOS stations. These data will be used in analyses by the LTER (Long Term Ecological Research) project. Continuously sample profiles of currents using the RDI/Teledyne Acoustic Doppler Current Profiler. This will be dependent on the ability to synchronize the ADCP’s output with the EK80 and ME70. The EK80 and ME70 will hold priority over the ADCP. Continuously observe, during daylight hours, seabirds and marine mammals. These data will be used to estimate the distributions and abundances of seabirds and marine mammals.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
From the parent record held in the GCMD:
The data sets in the CDC archive called "Reynolds SST' and "Reconstructed Reynolds SST" were discontinued on 1 April 2003.
A new OI SST data set is available as described here, which includes a new analysis for the historical data and updates into the future. NCEP will not provide new data for the "Reynolds SST" after December 2002 and CDC will remove the "Reynolds SST" data set on 1 April 2003.
TO SEE THE NEW DATASET, PLEASE SEARCH THE GLOBAL CHANGE MASTER DIRECTORY FOR MORE INFORMATION. REFER TO THE METADATA RECORD (LINKED BELOW): REYNOLDS_SST
This metadata record is a modified child record of an original parent record registered at the Global Change Master Directory. (The Entry ID of the parent record is REYNOLDS_SST, and can be found on the GCMD website - see the provided URL). The data described here are a subset of the original dataset. This metadata record has been created for the express use of Australian Government Antarctic Division employees.
Reproduced from: http://www.emc.ncep.noaa.gov/research/cmb/sst_analysis/
Analysis Description and Recent Reanalysis
The optimum interpolation (OI) sea surface temperature (SST) analysis is produced weekly on a one-degree grid. The analysis uses in situ and satellite SSTs plus SSTs simulated by sea ice cover. Before the analysis is computed, the satellite data are adjusted for biases using the method of Reynolds (1988) and Reynolds and Marsico (1993). A description of the OI analysis can be found in Reynolds and Smith (1994). The bias correction improves the large scale accuracy of the OI.
In November 2001, the OI fields were recomputed for late 1981 onward. The new version will be referred to as OI.v2.
The most significant change for the OI.v2 is the improved simulation of SST obs from sea ice data following a technique developed at the UK Met Office. This change has reduced biases in the OI SST at higher latitudes. Also, the update and extension of COADS has provided us with improved ship data coverage through 1997, reducing the residual satellite biases in otherwise data sparse regions.
The data are available in the following formats:
Net CDF Flat binary files Text
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here we provide CO2-system properties that were continuously measured in a southeast-northwest transect in the South Atlantic Ocean in which six Agulhas eddies were sampled. The Following Ocean Rings in the South Atlantic (FORSA) cruise occurred between 27th June and 15th July 2015, from Cape Town – South Africa to Arraial do Cabo – Brazil, on board the first research cruise of the Brazilian Navy RV Vital de Oliveira, as part of an effort of the Brazilian High Latitude Oceanography Group (GOAL). Finally, it contributed to the activities developed by the following Brazilian networks: GOAL, Brazilian Ocean Acidification Network (BrOA), Brazilian Research Network on Global Climate Change (Rede CLIMA). The focus of the first study using this dataset (Orselli et al. 2019a) was on investigate the role played by the Agulhas eddies on the sea-air CO2 net flux along their trajectories through the South Atlantic Ocean and model the seawater CO2–related properties as function of environmental parameters. This data has been used to contribute to the scientific discussion about the Agulhas eddies impact on the changes of the marine carbonate system, which is an expanding oceanographic subject (Carvalho et al. 2019; Orselli et al. 2019b; Ford et al. 2023). Seawater and atmospheric CO2 molar fraction (xCO2sw and xCO2atm, respectively) were continuously measured during the cruise track, as well as the sea surface temperature (T) and salinity (S). The following sampling methodology is fully described in Orselli et al. (2019a). The underway xCO2 sampling was taken using an autonomous system GO–8050, General Oceanic®, equipped with a non-dispersive infrared gas analyzer (LI–7000, LI–COR®). The underway T and S were sampled using a Sea-Bird® Thermosalinograph SBE21. Seawater intake to feed the continuous systems of the GO-8050 and the SBE21 was set at ~5 m below the sea surface. The xCO2 system was calibrated with four standard gases (CO2 concentrations of 0, 202.10, 403.20, and 595.50 uatm) within a 12 h interval along the entire cruise. Every 3 h the system underwent a standard reading, to check the derivation and allow the xCO2 corrections. The xCO2 measurements were taken within 90 seconds interval. After a hundred of xCO2sw readings, the system was changed to atmosphere and five xCO2atm readings were taken (Pierrot et al., 2009). xCO2 (umol mol–1) inputs were corrected by the CO2 standards (Pierrot et al., 2009). Thermosalinograph data were corrected using the CTD surface data. Then, together with the pressure data, these data were used to calculate the pCO2 of the equilibrator and atmosphere (pCO2eq and pCO2atm, respectively, uatm), following Weiss & Price (1980). Using the pCO2eq, which is calculated at the equilibrator temperature, it is possible to calculate the pCO2 at the in situ temperature (pCO2sw, uatm), according to Takahashi et al. (2009). Another common calculation regarding pCO2sw data, is the temperature-normalized pCO2sw (NpCO2sw, uatm). This means that the temperature effect is removed when one calculates the NpCO2sw for the mean cruise temperature. The procedure followed the Takahashi et al. (2009) and considered the mean cruise temperature of 20.39°C. The results obtained allow one to investigate the exchanges of CO2 at the ocean-atmosphere interface by calculating the pCO2 difference between these two reservoirs (DeltapCO2, DpCO2=pCO2sw–pCO2atm, uatm). Negative (positive) DpCO2 results indicate that the ocean acts as a CO2 sink (source) for the atmosphere. To determine the FCO2, the monthly mean wind speed data of July 2015 (at 10 m height) were extracted from the ERA-Interim atmospheric reanalysis product of the European Centre for Medium Range Weather Forecast (http://apps.ecmwf.int/datasets/data/interim-full-moda/levtype=sfc/) since the use of long-term mean is usual (e.g., Takahashi et al., 2009). The average wind speed for the period and whole area was 6.8 ± 0.6 m s−1, ranging from 5.6 to 8.3 m s−1. The CO2 transfer coefficients proposed by Takahashi et al. (2009) and Wanninkhof (2014) were used. With all these data together, the FCO2 was determined according to Broecker & Peng (1982), where FCO2 is the sea-air CO2 net flux (mmol m–2 d–1; FT09 and FW14 are the Sea-air CO2 flux calculated using the coefficients described in Takahashi et al. (2009) and Wanninkhof (2014), respectively).
Facebook
TwitterSurvey the distributions and abundances of pelagic fish stocks, their prey, and their biotic and abiotic environments in the area of the California Current between San Francisco, California and San Diego, California. The following are specific objectives for the spring CalCOFI. I.D.1. Continuously sample pelagic fish eggs using the Continuous Underway Fish Egg Sampler (CUFES). The data will be used to estimate the distributions and abundances of spawning hake, anchovy, mackerel, and Pacific sardine. I.D.2. Continuously sample sea-surface temperature, salinity, and chlorophyll-a using a thermosalinometer and fluorometer. These data will be used to estimate the physical oceanographic habitats for target species. I.D.3. Continuously sample air temperature, barometric pressure, and wind speed and direction using an integrated weather station. I.D.4. Sample profiles of seawater temperature, salinity, chlorophyll-a, nutrients, and phytoplankton using a CTD with water-sampling rosette and other instruments at prescribed stations. Measurements of extracted chlorophyll and phaeophytin will be obtained with a fluorometer. Primary production will be measured as C14 uptake in a six hour in situ incubation. Nutrients will be measured with an auto-analyzer. These data will be used to estimate primary productivity and the biotic and abiotic habitats for target species. I.D.5. Sample the light intensity in the photic zone using a standard Secchi disk once per day in conjunction with a daytime CTD station. These data will be used to interpret the measurements of primary production. I.D.6. Sample plankton using a CalBOBL (CalCOFI Bongo Oblique) at prescribed stations. These data will be used to estimate the distributions and abundances of ichthyoplankton and zooplankton species. I.D.7. Sample plankton using a Manta (neuston) net at prescribed stations. These data will be used to estimate the distributions and abundances of ichthyoplankton species. I.D.8. Sample the vertically integrated abundance of fish eggs using a Pairovet net at prescribed stations. These data will be used to quantify the abundances and distributions of fish eggs. I.D.9. Sample plankton using a PRPOOS (Planktonic Rate Processes in Oligotrophic Ocean Systems net) at all prescribed CalCOFI stations on lines 90.0 and 80.0 as well as stations out to and including station 70.0 on lines 86.7 and 83.3 and station 81.8 46.9. PRPOOS will not be towed on SCCOOS stations. These data will be used in analyses by the LTER (Long Term Ecological Research) project. I.D.10. Continuously sample profiles of currents using the RDI/Teledyne Acoustic Doppler Current Profiler. I.D.11. Continuously observe, during daylight hours, seabirds and mammals. These data will be used to estimate the distributions and abundances of seabirds and marine mammals.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data sources for Badger an open source budget execution & data analysis tool for federal budget analysts with the environmental protection agency based on WPF, Net 6, and is written in C#.
Databases play a critical role in environmental data analysis by providing a structured system to store, organize, and efficiently retrieve large amounts of data, allowing analysts to easily access and manipulate information needed to extract meaningful insights through queries and analysis tools; essentially acting as the central repository for data used in data analysis processes. Badger provides the following providers to store and analyze data locally.
bin - Binaries are included in the bin folder due to the complex Baby setup required. Don't empty this folder.bin/storage - HTML and JS required for downloads manager and custom error pages
_Environmental...
Facebook
TwitterRepresentative dairy farms in major dairy regions of the United States were modeled using the Integrated Farm System Model to quantify potential reductions in greenhouse gas emissions using various mitigation strategies. Important data and information describing these 14 farms are documented in this table. These data include the farm location, number of cows and heifers maintained, milk produced, feeds and nutrient contents fed, crop areas, crop yields, fertilizer and lime application rates, irrigation water applied, milking and housing facilities, manure collection, storage and application methods used, and soil characteristics. Simulated output information for feed consumption, nutrient losses, fossil energy use, water use, and greenhouse gas emissions are listed for each farm. These data are published as supplementary information for the article “Strategies for mitigating greenhouse gas emissions from US dairy farms toward a net zero goal” published in the Journal of Dairy Science.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General overview The following datasets are described by this metadata record, and are available for download from the provided URL.
####
Physical parameters raw log files
Raw log files 1) DATE= 2) Time= UTC+11 3) PROG=Automated program to control sensors and collect data 4) BAT=Amount of battery remaining 5) STEP=check aquation manual 6) SPIES=check aquation manual 7) PAR=Photoactive radiation 8) Levels=check aquation manual 9) Pumps= program for pumps 10) WQM=check aquation manual
####
Respiration/PAM chamber raw excel spreadsheets
Abbreviations in headers of datasets Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.
Date: ISO 1986 - Check Time:UTC+11 unless otherwise stated DATETIME: UTC+11 unless otherwise stated ID (of instrument in respiration chambers) ID43=Pulse amplitude fluoresence measurement of control ID44=Pulse amplitude fluoresence measurement of acidified chamber ID=1 Dissolved oxygen ID=2 Dissolved oxygen ID3= PAR ID4= PAR PAR=Photo active radiation umols F0=minimal florescence from PAM Fm=Maximum fluorescence from PAM Yield=(F0 – Fm)/Fm rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only) Temp=Temperature degrees C PAR=Photo active radiation PAR2= Photo active radiation2 DO=Dissolved oxygen %Sat= Saturation of dissolved oxygen Notes=This is the program of the underwater submersible logger with the following abreviations: Notes-1) PAM= Notes-2) PAM=Gain level set (see aquation manual for more detail) Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual) Notes-5) Yield step 2=PAM yield measurement and calculation of control Notes-6) Yield step 5= PAM yield measurement and calculation of acidified Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.
8) Rapid light curve data Pre LC: A yield measurement prior to the following measurement After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor. PAM PAR: This is copied from the PAR or PAR2 column PAR all: This is the complete PAR file and should be used Deployment: Identifying which deployment the data came from
####
Respiration chamber biomass data
The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.
####
Associated R script file for pump cycles of respirations chambers
Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.
To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.
####
Combined dataset pH, temperature, oxygen, salinity, velocity for experiment
This data is rapid light curve data generat...
Facebook
TwitterThis data card was written by chat-GPT Some part of the data card are not ready yet
Although the dataset is publicly available, it is inconsistently maintained — file structures and column definitions vary between months. For analysis, all files have been downloaded and extracted into a single folder named “unzipped”, preserving their raw form.
Each CSV file typically contains the following columns (column names and availability may differ by month): - rent_time: [description placeholder] - rent_station: [description placeholder] - return_time: [description placeholder] - return_station: [description placeholder] - rent: [description placeholder] - type: [description placeholder — introduced around October 2024 to indicate bicycle type] - infodate: [description placeholder]
YouBike 2.0 Taipei City Real-Time Information - Source: https://data.nat.gov.tw/dataset/147580 - Provider: 臺北市政府交通局 (Taipei City Department of Transportation) - Data URL (updated every minute): https://tcgbusfs.blob.core.windows.net/dotapp/youbike/v2/youbike_immediate.json - Language: Chinese (Traditional) - Format: JSON
This dataset provides real-time information on the YouBike 2.0 public bicycle system in Taipei City. The data include the location, capacity, and availability of bicycles and parking spaces at each station. Updates occur approximately every minute (but not in this dataset), and the dataset is published by the Taipei City Department of Transportation through the National Open Data Platform.
The dataset used in this project was collected for the purpose of obtaining station coordinates (latitude and longitude) and other static station information. These geographic data can be used to visualize the distribution of YouBike stations across Taipei City.
If users require up-to-date or continuously refreshed information, it is recommended to directly access the real-time JSON feed via the provided URL above.
Each record represents a single YouBike 2.0 station and includes the following fields: - sno (站點代號 / Station ID): [description placeholder] - sna (場站中文名稱 / Station Name - Chinese): [description placeholder] - quantity (場站總停車格 / Total Parking Spaces): [description placeholder] - available_rent_bikes (場站目前車輛數量 / Available Bicycles for Rent): [description placeholder] - sarea (場站區域 / Administrative Area - Chinese): [description placeholder] - mday (資料更新時間 / Record Update Time): [description placeholder] - latitude (緯度 / Latitude): [description placeholder] - longitude (經度 / Longitude): [description placeholder] - ar (地點 / Address - Chinese): [description placeholder] - sareaen (場站區域英文 / Administrative Area - English): [description placeholder] - snaen (場站名稱英文 / Station Name - English): [description placeholder] - aren (地址英文 / Address - English): [description placeholder] - available_return_bikes (空位數量 / Available Parking Spaces): [description placeholder] - act (全站禁用狀態 / Station Active Status): [description placeholder] - srcUpdateTime (YouBike2.0系統發布資料更新的時間 / Source Update Time from YouBike System): [description placeholder] - updateTime (大數據平台經過處理後將資料存入DB的時間 / Time When Data Were Processed and Stored in Database): [description placeholder] - infoTime (各場站來源資料更新時間 / Station Data Update Time): [description placeholder] - infoDate (各場站來源資料更新日期 / Station Data Update Date): [description placeholder]
(Chinese) - 臺北市政府交通局;2025;臺北市公共自行車2.0租借紀錄 - 此開放資料依政府資料開放授權條款 (Open Government Data License) 進行公眾釋出,使用者於遵守本條款各項規定之前提下,得利用之。 - 政府資料開放授權條款:https://data.gov.tw/license
(English) - Department of Transportation, Taipei City Government (DOT); 2025; 臺北市公共自行車2.0租借紀錄 - The Open Data is made available to the public under the Open Government Data License, User can make use of it when complying to the condition and obligation of its terms. - Open Government Data License:https://data.gov.tw/license
(Chinese) - 臺北市政府交通局;2025;YouBike2.0臺北市公共自行車即時資訊 - 此開放資料依政府資料開放授權條款 (Open Government Data License) 進行公眾釋出,使用者於遵守本條款各項規定之前提下,得利用之。 - 政府資料開放授權條款:https://data.gov.tw/license
(English) - Department of Transportation, Taipei City Government...
Facebook
TwitterSummer 2013 CCE. This objective was accomplished using the following equipment and protocols: • The primary goal of the survey is to estimate the biomasses, distributions, and biological compositions of Pacific hake and Pacific sardine populations using data from an integrated acoustic and trawl survey off the west coast of the U.S. and Canada from approximately San Diego, California (lat 32°48.0174’N) to the north end of Vancouver Island, Canada (lat 50°45.65’N). • Continuously sample multi-frequency acoustic backscatter data using the ship’s Simrad EK60 scientific echo sounder system. These data will be used to estimate the distributions and abundances of hake and sardine. • Conduct daytime trawling to classify observed backscatter layers to species and size composition and to collect specimens of hake and other organisms. • Conduct nighttime (i.e., between sunset and sunrise) surface trawling to collect specimens of coastal pelagic fishes (CPS) and other organisms. These data will be used to classify observed backscatter to species and their size distributions. Nighttime sampling operations will conclude in time for the ship to resume running east-west acoustic transects by sunrise. • Image fish using a portable X-radiograph machine for the purpose of target strength modeling and estimation. • Collect a variety of other acoustic, biological, and oceanographic samples relevant to hake and sardine distributions. These data are vital for the surveys and assessments of hake and sardine. 3 • Continuously sample sea-surface temperature, salinity, and chlorophyll a using the ship’s thermosalinograph and fluorometer. These data will be used to estimate the physical oceanographic habitats for each target species. • Continuously sample air temperature, barometric pressure, and wind speed and direction using the ship’s integrated weather station. • Continuously sample pelagic fish eggs using the Continuous Underway Fish Egg Sampler (CUFES). The data will be used to estimate the distributions and abundances of spawning hake, anchovy, mackerel, and sardine. • Sample profiles of temperature and salinity using either an underway conductivity-temperature-depth (CTD) system during the day or a standard CTD system with water-sampling rosette and other instruments at nighttime stations, as time allows. • Sample plankton using a CalBOBL (CalCOFI Bongo) net at nighttime stations, as time allows. These data will be used to estimate the distribution and abundance of ichthyoplankton and zooplankton species. • Continuously sample multi-frequency acoustic backscatter data using the ship’s Simrad ME70 multibeam echosounder system, synchronized and configured to not interfere with the EK60s. • Optically verify CPS backscatter while underway conducting acoustic transects, using a towed stereo camera system. • Optically observe fish behavior inside nighttime trawls using cameras and lights mounted inside the net.
Facebook
TwitterSalutary Data is a boutique, B2B contact and company data provider that's committed to delivering high quality data for sales intelligence, lead generation, marketing, recruiting / HR, identity resolution, and ML / AI. Our database currently consists of 148MM+ highly curated B2B Contacts ( US only), along with over 4MM+ companies, and is updated regularly to ensure we have the most up-to-date information.
We can enrich your in-house data ( CRM Enrichment, Lead Enrichment, etc.) and provide you with a custom dataset ( such as a lead list) tailored to your target audience specifications and data use-case. We also support large-scale data licensing to software providers and agencies that intend to redistribute our data to their customers and end-users.
What makes Salutary unique? - We offer our clients a truly unique, one-stop aggregation of the best-of-breed quality data sources. Our supplier network consists of numerous, established high quality suppliers that are rigorously vetted. - We leverage third party verification vendors to ensure phone numbers and emails are accurate and connect to the right person. Additionally, we deploy automated and manual verification techniques to ensure we have the latest job information for contacts. - We're reasonably priced and easy to work with.
Products: API Suite Web UI Full and Custom Data Feeds
Services: Data Enrichment - We assess the fill rate gaps and profile your customer file for the purpose of appending fields, updating information, and/or rendering net new “look alike” prospects for your campaigns. ABM Match & Append - Send us your domain or other company related files, and we’ll match your Account Based Marketing targets and provide you with B2B contacts to campaign. Optionally throw in your suppression file to avoid any redundant records. Verification (“Cleaning/Hygiene”) Services - Address the 2% per month aging issue on contact records! We will identify duplicate records, contacts no longer at the company, rid your email hard bounces, and update/replace titles or phones. This is right up our alley and levers our existing internal and external processes and systems.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3283916%2F6e74c12acafc8e9015fe587fb7537efd%2Fdataset.png?generation=1575311299667926&alt=media" alt="">
Microsoft Kinect is a motion sensing device, that was invented by Microsoft, is mainlyused for joystick-free games with Microsoft Xbox gaming console [1]. Microsoft Kinecthas two versions, Kinect 360 which was released in 2010, shown in Figure 1.4, and KinectOnewhich was released in 2013, shown in Figure below [2].
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3283916%2Fb14d439b6b99f7956c460357716a2c69%2Fkinect.png?generation=1575312218544019&alt=media" alt="">
This dataset is collected in the form of Robotics Operating System (ROS) bags that contain both RGB and depth images collected from the Kinect V2 sensor. This dataset is collected as a part of the research done in 3D Object Detection and Classification Using Microsoft Kinect and Deep Neural Networks master's degree thesis. This thesis is done as part of the master's degree program at Cairo University.
You can access all the source code released in this thesis using the following links:
This dataset contains three data files in the form of ROS bag. These data files contains 6 types of objects, where 5 of them are stationary and the other is moving between different positions along the scene. The 6 object types are:
Human (moving),
Chair (stationary),
Sofa (stationary),
TV Monitor (stationary),
Bottle (stationary),
Books (stationary).
Each of these objects can be found at certain depth position which are as follows:
Small TV: 2.43 m,
Large TV: 4.43 m,
Black Chair: 1.81 m,
White Chair (to the right): 2.14 m,
Sofa: 1.4 m,
Books: 1.96 m,
Bottle: 1.71 m,
Human Position #1: 1.05 m,
Human Position #2: 2.66 m,
Human Position #3: 3.81 m,
Human Position #4: 4.54 m.
Darknet is used in the system introduced in this thesis work to get the bounding box of each of these objects.
[1] A. S. Sabale and Y. M. Vaidya, “Accuracy measurement of depth using kinect sensor,”in2016 Conference on Advances in Signal Processing (CASP), pp. 155–159, June2016.
[2] “Kinect Sensor.”https://en.wikipedia.org/wiki/Kinect, 2012. [Online;accessed 28-August-2018].
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset include some onchain indicator information at daily basis for Bitcoin. This informations:
Inflow Volume
IntoTheBlock has built a proprietary machine learning powered classifier to identify addresses of top centralized exchanges, including their deposit addresses, withdrawal addresses, hot wallets and cold wallets. With this classifier, IntoTheBlock can measure the total amount of a given crypto-asset flowing into exchanges and measures this in dollar and crypto terms. The result is the Inflow Volume indicator.
Outlow Volume
While Inflow Volume at times anticipate volatility, Outflow Volume is often more reactive. In other words, Outflow Volume often spikes following either a crash or a significant break-out as shown in the example above. This could potentially be interpreted as users going long and opting to hold their crypto outside centralized exchanges.
Total Flows IntoTheBlock uses machine learning algorithms to identify centralized exchanges’ deposit and withdrawal addresses. Through this process, IntoTheBlock measures the total activity flowing in and out of centralized exchanges. The result is the Total Flows indicator which is measured the following way
Total Flows = Inflow Volume + Outlow Volume
Net Flows
The Net Flows indicator highlights trends of traders sending money in and out of exchanges. Recall that Net Flows are positive when more funds are entering than leaving exchanges. Therefore, we observe that positive Net Flows tend to coincide with periods following large increases in price (like LINK when it tripled between April and July) or confirmation of down-trends (as seen with LINK in late August).
Conversely, Net Flows are negative when a greater volume is being withdrawn from exchanges. This could be seen as a sign of accumulation (LINK in early August) or addresses buying back following large declines (LINK in early September).
While Net Flows also affect large cap crypto-assets, smaller cap tokens are more susceptible to large changes in prices deriving from exchange flows. This is simply a result of smaller caps requiring less capital in order to make market-moving trades. This is worth considering when using the Net Flows indicator to trade.
Net Flows = Inflow Volume - Outflow Volume
Outflow Transaction Count
The Outflow Transaction Count indicator provides indication of users withdrawing their funds from centralized exchanges likely to store in safer cold wallets. This is a valuable approximation of users going long and opting to hold their own funds. For this reason, outflows tend to spike as price crashes as pointed in the example above. While this can be the case on several occasions, natural fluctuations in exchanges’ flows can often have smaller spikes without regards to price action as well.
Inflow Transaction Count
As the name suggests, the Inflow Transaction Count indicator provides the number of incoming crypto transactions entering exchanges. While the Inflow Volume measures the aggregate dollar amount, which is influenced by whales’ transactions, the Inflow Transaction Count is a better approximation of the number of users sending funds into exchanges.
This indicator has also shown to rise along and anticipate periods of high volatility. For example, on September 1st, inflow transactions for Bitcoin hit a 3-month high preceding a decrease in price of 14% over the following 48 hours. While this pattern does tend to emerge, natural fluctuations in inflow transactions can also increase at times.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multi-aspect Integrated Migration Indicators (MIMI) dataset is the result of the process of gathering, embedding and combining traditional migration datasets, mostly from sources like Eurostat and UNSD Demographic Statistics Database, and alternative types of data, which consists in multidisciplinary features and measures not typically employed in migration studies, such as the Facebook Social Connectedness Index (SCI). Its purpose is to exploit these novel types of data for: nowcasting migration flows and stocks, studying integration of multiple sources and knowledge, and investigating migration drivers. The MIMI dataset is designed to have a unique pair of countries for each row. Each record contains country-to-country information about: migrations flows and stock their share, their strength of Facebook connectedness and other features, such as corresponding populations, GDP, coordinates, NET migration, and many others. Methodology. After having collected bilateral flows records about international human mobility by citizenship, residence and country of birth (available for both sexes and, in some cases, for different age groups), they have been merged together in order to obtain a unique dataset in which each ordered couple (country-of-origin, country-of-destination) appears once. To avoid duplicate couples, flow records have been selected by following this priority: first migration by citizenship, then migration by residence and lastly by country of birth. The integration process started by choosing, collecting and meaningfully including many other indicators that could be helpful for the dataset final purpose mentioned above. International migration stocks (having a five-year range of measurement) for each couple of countries. Geographical features for each country: ISO3166 name and official name, ISO3166-1 alpha-2 and alpha-3 codes, continent code and name of belonging, latitude and longitude of the centroid, list of bordering countries, country area in square kilometres. Also, the following features have been included for each pair of countries: geodesic distance (in kilometres) computed between their respective centroids. Non-bidirectional migration measures for each country: total number of immigrants and emigrants for each year, NET migration and NET migration rate in a five-year range. Other multidisciplinary indicators (cultural, social, anthropological, demographical, historical features) related to each country: religion (single one or list), yearly GDP at PPP, spoken language (or list of languages), yearly population stocks (and population densities if available), number of Facebook users, percentage of Facebook users, cultural indicators (PDI, IDV, MAS, UAI, LTO). Also the following feature have been included for each pair of countries: Facebook Social Connectedness Index. Once traditional and non-traditional knowledge is gathered and integrated, we move to the pre-processing phase where we manage the data cleaning, preparation and transformation. Here our dataset was subjected to various computational standard processes and additionally reshaped in the final structure established by our design choices. The data quality assessment phase was one of the longest and most delicate, since many values were missing and this could have had a negative impact on the quality of the desired resulting knowledge. They have been integrated from additional sources such as The World Bank, World Population Review, Statista, DataHub, Wikipedia and in some cases extracted from Python libraries such as PyPopulation, CountryInfo and PyCountry. The final dataset has the structure of a huge matrix having countries couples as index (uniquely identified by coupling their ISO 3166-1 alpha-2 codes): it comprises 28725 entries and 485 columns.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Zenodo repository contains all migration flow estimates associated with the paper "Deep learning four decades of human migration." Evaluation code, training data, trained neural networks, and smaller flow datasets are available in the main GitHub repository, which also provides detailed instructions on data sourcing. Due to file size limits, the larger datasets are archived here.
Data is available in both NetCDF (.nc) and CSV (.csv) formats. The NetCDF format is more compact and pre-indexed, making it suitable for large files. In Python, datasets can be opened as xarray.Dataset objects, enabling coordinate-based data selection.
Each dataset uses the following coordinate conventions:
The following data files are provided:
T summed over Birth ISO). Dimensions: Year, Origin ISO, Destination ISOAdditionally, two CSV files are provided for convenience:
imm: Total immigration flowsemi: Total emigration flowsnet: Net migrationimm_pop: Total immigrant population (non-native-born)emi_pop: Total emigrant population (living abroad)mig_prev: Total origin-destination flowsmig_brth: Total birth-destination flows, where Origin ISO reflects place of birthEach dataset includes a mean variable (mean estimate) and a std variable (standard deviation of the estimate).
An ISO3 conversion table is also provided.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Acknowledgement and Disclaimers
These data are a product of a research activity conducted in the context of the RAILS (Roadmaps for AI integration in the raiL Sector) project. RAILS has received funding from the Shift2Rail Joint Undertaking (JU) under the European Union’s Horizon 2020 research and innovation programme under grant agreement n. 881782 Rails. The JU receives support from the European Union’s Horizon 2020 research and innovation program and the Shift2Rail JU members other than the Union.
The information and views set out in this description are those of the author(s) and do not necessarily reflect the official opinion of Shift2Rail Joint Undertaking. The JU does not guarantee the accuracy of the data included in this dataset. Neither the JU nor any person acting on the JU’s behalf may be held responsible for the use which may be made of the information contained therein.
This "dataset" has been created for scientific purposes only to study the potentials of Deep Learning (DL) approaches when used to analyse Video Data in order to detect possible obstacles on rail tracks and thus avoid collisions. The authors DO NOT ASSUME any responsibility for the use that other researchers or users will make of these data.
Objectives of the Study
RAILS defined some pilot case studies to develop Proofs-of-Concept (PoCs), which are conceived as benchmarks, with the aim of providing insight towards the definition of technology roadmaps that could support future research and/or the deployment of AI applications in the rail sector. In this context, the main objectives of the specific PoC "Vision-Based Obstacle Detection on Rail Tracks" were to investigate: i) solutions for the generation of synthetic data, suitable for the training of DL models; and ii) the potential of DL applications when it comes to detecting any kind of obstacles on rail tracks while exploiting video data from a single RGB camera.
A Brief Overview of the Approach
A multi-modular approach has been proposed to achieve the objectives mentioned above. The resulting architecture includes the following modules:
The Rails Detection Module (RDM) detects rail tracks. The output of the RDM is used by the ODM and ADM.
The Object Detection Module (ODM) detects obstacles whose type is known in advance.
The Anomaly Detection Module (ADM) identifies any possible anomaly on rail tracks. These include obstacles whose type is not known in advance.
The Obstacle Detection Module merges the outputs from the ODM and the ADM.
The Distance Estimation Module estimates the distance of objects and anomalies from the train.
The research was specifically oriented at implementing the RDM-ADM pipeline. Indeed, the object detection approaches that would be used to implement the ODM have been widely investigated by the research community, instead, to the best of our knowledge, limited work has been done in the rails field in the context of anomaly detection. The RDM has been realised by adopting a Semantic Segmentation approach based on U-Net; while, to develop the ADM, a Vector-Quantized Variational Autoencoder trained in Unsupervised mode was leveraged. Further details can be found in the RAILS "Deliverable D2.3: WP2 Report on experimentation, analysis, and discussion of results".
Steps to implement the RDM-ADM pipeline and description of shared Data
The following list reports all the steps that have been performed to implement the RDM-ADM pipeline; the words in bold-italic refer to the files that are shared within this dataset:
A Railway Scenario was generated in MathWorks' RoadRunner.
A video (FreeTrackVideo) was recorded by simulating an RGB camera mounted in front of the train; no obstacles on rail tracks were considered in this phase.
2000 frames (FreeTrack2KFrames) were extracted from the aforementioned video. The video contains 4143 frames, however, only 2000 (each other frame starting from the first one) were taken into account due to training time and GPU RAM constraints.
Only 10% of the 2000 frames were manually labelled (i.e., 200 frames, a frame every 10 frames) by exploiting LabelMe; these frames were then subdivided into training and validation sets (InitialLabelledSet).
Hence, a Semi-Automatic labelling algorithm was developed by leveraging self-training and transfer learning. This algorithm made it possible to label all the FreeTrack2KFrames starting from the InitialLabelledSet. The resulting labels can be found in FreeTrack2KLabels.
Data Augmentation was then performed in order to introduce some aleatory in the dataset. Because of the same time and RAM constraints mentioned above, the FreeTrack2KFrames set of data was reduced further: 1600 frames were selected among the aforementioned 2000 and then 5 transformations (Bright, Dark, Rain, Shadow, and Sun Flare) were applied to obtain the dataset (FreeTrack16TrainSet, FreeTrack16ValSet, FreeTrack16TestSet) that was used to train, validate, and test the RDM.
Once the RDM was trained, the FreeTrackVideo was processed to obtain the masked frames that were then used to build the dataset(s) to train, validate, and test the ADM. The ADM was studied by considering two different datasets: the Non-Anomaly Dataset (NAD), which basically contains all the frames of the FreeTrackVideo once processed by the RDM; and the Augmented Non-Anomaly Dataset (A-NAD), which contains 9000 frames, 1500 of which were extracted from the NAD, while the remaining 7500 were obtained by applying the same transformations mentioned above.
Lastly, when both the RDM and the ADM were trained, the performances of the whole RDM-ADM pipeline were tested on the WithCarVideo which depicts the same scenario as the FreeTrackVideo but it also depicts a car laying on the rail tracks (i.e., an obstacle).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...