Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.
This dataset took the data from the original dataset and convert the images to numpy arrays to make easier the processing of the umages.
10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10. 73257 digits for training, 26032 digits for testing, Comes in two formats:
1. Original images with character level bounding boxes.
2. MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides).
All the credit to:
http://ufldl.stanford.edu/housenumbers/
and,
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Summary
A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains
30 completely labeled (segmented) images
71 partly labeled images
altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)
To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects
A set of metrics and a novel ranking score for respective meaningful method benchmarking
An evaluation of three baseline methods in terms of the above metrics and score
Abstract
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
Dataset documentation:
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
FISBe Datasheet
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Files
fisbe_v1.0_{completely,partly}.zip
contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.
fisbe_v1.0_mips.zip
maximum intensity projections of all samples, for convenience.
sample_list_per_split.txt
a simple list of all samples and the subset they are in, for convenience.
view_data.py
a simple python script to visualize samples, see below for more information on how to use it.
dim_neurons_val_and_test_sets.json
a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.
Readme.md
general information
How to work with the image files
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env
How to open zarr files
Install the python zarr package:
pip install zarr
Opened a zarr file with:
import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")
Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.
How to view zarr image files
We recommend to use napari to view the image data.
Install napari:
pip install "napari[all]"
Save the following Python script:
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()
Execute:
python view_data.py /R9F03-20181030_62_B5.zarr
Metrics
S: Average of avF1 and C
avF1: Average F1 Score
C: Average ground truth coverage
clDice_TP: Average true positives clDice
FS: Number of false splits
FM: Number of false merges
tp: Relative number of true positives
For more information on our selected metrics and formal definitions please see our paper.
Baseline
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.
License
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Citation
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }
Acknowledgments
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.
Changelog
There have been no changes to the dataset so far.All future change will be listed on the changelog page.
Contributing
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
Facebook
TwitterThis dataset was created by Tanay Mehta
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.
This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.
*datasets are NumPy arrays (v1.23) created in Python v3.8.18.
**all ML models were created with Keras in Python v3.10.10.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Image Impeccable
Dataset Description
This data was produced by ThinkOnward for the Image Impeccable Challenge, using a synthetic seismic dataset generator called Synthoseis.
Created by: Mike McIntire and Jesse Pisel License: CC 4.0
Uses
How to generate a dataset
This dataset is provided as paired noisy and clean seismic volumes. Follow the following step to load the data to numpy volumes import pandas as pd import numpy as… See the full description on the dataset page: https://huggingface.co/datasets/thinkonward/image-impeccable.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game "Quick, Draw!". The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.
Example drawings: https://raw.githubusercontent.com/googlecreativelab/quickdraw-dataset/master/preview.jpg" alt="preview">
Facebook
TwitterThis dataset has been generated in order to investigate how particles (i.e. oil from a spill) behave on the surface when the ocean is dominated by convection. We found that there is a preferential concentration mechanism that dominates the surface signature. The dataset is submitted as a zip file with of pair of files for each figure. Each pair contains the data (in numpy-npz format) and a small python script to read the data.
Facebook
TwitterNumPy archive files can be opened using Python and NumPy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Read me file for the data repository ******************************************************************************* This repository has raw data for the publication "Enhancing Carrier Mobility In Monolayer MoS2 Transistors With Process Induced Strain". We arrange the data following the figure in which it first appeared. For all electrical transfer measurement, we provide the up-sweep and down-sweep data, with voltage units in V and conductance unit in S. All Raman modes have unit of cm^-1. ******************************************************************************* How to use this dataset All data in this dataset is stored in binary Numpy array format as .npy file. To read a .npy file: use the Numpy module of the python language, and use np.load() command. Example: suppose the filename is example_data.npy. To load it into a python program, open a Jupyter notebook, or in the python program, run: import numpy as np data = np.load("example_data.npy") Then the example file is stored in the data object. *******************************************************************************
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary
This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.
They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD
How to Get Started
All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:
import numpy as np
hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")
mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")
Citing the Datasets
If you use any of these datasets, please cite the following paper:
@article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}
If you use the beach dataset please cite the following paper as well (original source):
@article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This supports the manuscript titled "Stratosphere-Troposphere Exchange of Water Vapor Based on Observations and Reanalyses" submitted to *Geophysical Research Letters*.
It includes:
- Python and NCL scripts used for processing satellite observational data from COSMIC, CloudSat, MLS, and reanalyses data.
- Post-processed Reanalysis data from ERA5 and MERRA-2 and observational data
- Final data used for figures and statistical analyses.
- Jupyter notebooks for reproducing Figures 1–3, Table 1 and supplementary figures.
All data are stored in NetCDF or NumPy `.npz` format.
Please refer to the included `README.md` for detailed instructions on data structure, variable definitions, and how to reproduce the results.
Facebook
TwitterThis is a single numpy array with all the Optiver data joined together. It also has some of the features from this notebook It's designed to be mmapped so that you can read small pieces at once.
This is one big array with the trade and book data joined together plus some pre-computed features. The dtype of the array if fp16. The arrays shape is (n_times,n_stocks,600,27) where 600 is the max second_in_bucket and 27 is the number of columns.
Add the dataset to your notebook and then
python
import numpy as np
ntimeids=3830
nstocks=112
ncolumns = 27
nseq = 600
arr = np.memmap('../input/optiver-precomputed-features-numpy-array/data.array',mode='r',dtype=np.float16,shape=(ntimeids,nstocks,600,ncolumns))
There are gaps in the stock ids and time ids, which doesn't work great with an array format. So we have time and stocks indexes as well (_ix suffix instead of _id). To calculate these:
import numpy as np
import pandas as pd
import numpy as np
targets = pd.read_csv('/kaggle/input/optiver-realized-volatility-prediction/train.csv')
ntimeids = targets.time_id.nunique()
stock_ids = list(sorted(targets.stock_id.unique()))
timeids = sorted(targets.time_id.unique())
timeid_to_ix = {time_id:i for i,time_id in enumerate(timeids)}
stock_id_to_ix = {stock_id:i for i,stock_id in enumerate(stock_ids)}
So to get the data for stock_id 13 on time_id 146 you'd do
stock_ix = stock_id_to_ix[13]
time_ix = timeid_to_ix[146]
arr[time_ix,stock_ix]
Notice that the third dimension is of size 600 (the max number of points for a given time_ix,stock_id. Some of these will be empty.
To get truncate a single stocks data do
max_seq_ix = (arr[time_ix,stock_ix,:,-1]>0).cumsum().max()
arr[time_ix,stock_ix,:max_seq_ix,]
There are 27 columns in the last dimension these are:
['time_id', 'seconds_in_bucket', 'bid_price1', 'ask_price1', 'bid_price2', 'ask_price2', 'bid_size1', 'ask_size1', 'bid_size2', 'ask_size2', 'stock_id', 'wap1', 'wap2', 'log_return1', 'log_return2', 'wap_balance', 'price_spread', 'bid_spread', 'ask_spread', 'total_volume', 'volume_imbalance', 'price', 'size', 'order_count', 'stock_id_y', 'log_return_trade', 'target']
Facebook
TwitterIn my opinion it was horrible to import these images into Kaggle the right way. The way I was used to is to use the Keras dataset and use cifar10.load_data(), but that does not work with Kaggle.
That is why I downloaded each x_train,y_train, x_test, y_test, packed them into a compressed numpy array, and uploaded them here.
How you would import them using Keras: (x_train, y_train), (x_test, y_test) = cifar10.load_data()
How you can import them now:
import numpy as np
data = np.load("/kaggle/input/cifar10-keras-files-cifar10load-data/cifar-10.npz")
filenames = ["x_train","y_train","x_test","y_test"]
nps = []
for filename in filenames:
nps.append(data[filename])
x_train,y_train,x_test,y_test = nps
Further information regarding the dataset: https://www.cs.toronto.edu/~kriz/cifar.html
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
**### Context
EDA using numpy and pandas
In this Task i have to predict what factors makes an app perform well .whether its size , price , category or multiple factors together . what makes an app rank on the top in google Playstore .**
Column description: App : name of the application Category: category of the application Rating: rating of an application Reviews: reviews of that application Size: size of application Installs:how many users installed that application Type: Type of application Price: price of application content rating:rating of content of the application
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Facebook
TwitterThis dataset provides preprocessed skeleton sequences from the NTU RGB+D 60 benchmark, widely used for skeleton-based human action recognition.
The preprocessing module standardizes the raw NTU skeleton data to make it directly usable for training deep learning models.
Each skeleton sequence was processed by:
Two .npz files are provided, following the standard evaluation protocols:
NTU60_CS.npz → Cross-Subject splitNTU60_CV.npz → Cross-View splitEach file contains:
x_train → Training data, shape (N_train, 300, 150)y_train → Training labels, shape (N_train, 60) (one-hot)x_test → Testing data, shape (N_test, 300, 150)y_test → Testing labels, shape (N_test, 60) (one-hot)If a sequence has only 1 person, the second person’s features are zero-filled.
(0,0,0)These .npz files can be directly loaded in PyTorch or NumPy-based pipelines.
They are fully compatible with graph convolutional networks (GCNs), transformers, and other deep learning models for skeleton-based action recognition.
Example:
import numpy as np
data = np.load("NTU60_CS.npz")
x_train, y_train = data["x_train"], data["y_train"]
print(x_train.shape) # (N_train, 300, 150)
print(y_train.shape) # (N_train, 60)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains pre-processed time-series data for a binary classification task to determine whether a drone is healthy or faulty based on its motion data. The data has been windowed and is ready for use with sequence-based deep learning models like LSTMs, GRUs, or 1D CNNs.
Source Data:
The data is derived from the "DronePropA: Motion Trajectories Dataset for Defective Drones" by Ismail, Elshaar, et al. The original dataset consists of 130 .mat files, each representing a single flight experiment.
Preprocessing Steps:
The original .mat files have been processed to create a single, model-ready .npz file. The following steps were applied:
1. Feature Extraction: For each of the 130 flights, 12 specific time-series features were extracted, focusing on the drone's core motion dynamics.
2. Labeling: Each flight was labeled as healthy (0) or faulty (1) based on the file naming convention described in the source paper.
3. Windowing: The time-series data from each flight was segmented into overlapping windows. Each window is 200 time-steps long with a 50% overlap between consecutive windows.
4. Aggregation: All windows from all flights were stacked into a single dataset.
The data is contained in a single compressed NumPy archive file: proceed_data.npz. This file contains two arrays: X and y.
X: A 3-dimensional NumPy array containing the feature data.
(num_windows, 200, 12)y: A 1-dimensional NumPy array containing the corresponding labels for each window in X.
(num_windows,)0 for a healthy window or 1 for a faulty window.You can load the data easily using NumPy.
import numpy as np
# Load the dataset
data = np.load('proceed_data.npz')
# Extract the features and labels
X = data['X']
y = data['y']
print("Data loaded successfully!")
print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")
## Dataset Details
**Features:**
The 12 features in the third dimension of the `X` array are in the following order:
1. Position X (meters)
2. Position Y (meters)
3. Position Z (meters)
4. Roll (radians)
5. Pitch (radians)
6. Yaw (radians)
7. Roll Rate (rad/s)
8. Pitch Rate (rad/s)
9. Yaw Rate (rad/s)
10. Acceleration X (m/s²)
11. Acceleration Y (m/s²)
12. Acceleration Z (m/s²)
**Labels:**
The labels in the `y` array are defined as:
* `0`: Healthy
* `1`: Faulty
## Citation
If you use this dataset, please cite the original authors of the DronePropA dataset.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The dataset contains housing market information for different areas of London over time. It includes details such as average house prices, the number of houses sold, and crime statistics. The data spans multiple years and is organized by date and geographic area.
This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.
Using this dataset, we answered multiple questions with Python in our Project.
Q. 1) Convert the Datatype of 'Date' column to Date-Time format.
Q. 2.A) Add a new column ''year'' in the dataframe, which contains years only.
Q. 2.B) Add a new column ''month'' as 2nd column in the dataframe, which contains month only.
Q. 3) Remove the columns 'year' and 'month' from the dataframe.
Q. 4) Show all the records where 'No. of Crimes' is 0. And, how many such records are there ?
Q. 5) What is the maximum & minimum 'average_price' per year in england ?
Q. 6) What is the Maximum & Minimum No. of Crimes recorded per area ?
Q. 7) Show the total count of records of each area, where average price is less than 100000.
Enrol in our Udemy courses : 1. Python Data Analytics Projects - https://www.udemy.com/course/bigdata-analysis-python/?referralCode=F75B5F25D61BD4E5F161 2. Python For Data Science - https://www.udemy.com/course/python-for-data-science-real-time-exercises/?referralCode=9C91F0B8A3F0EB67FE67 3. Numpy For Data Science - https://www.udemy.com/course/python-numpy-exercises/?referralCode=FF9EDB87794FED46CBDF
These are the main Features/Columns available in the dataset :
1) Date – The month and year when the data was recorded.
2) Area – The London borough or area for which the housing and crime data is reported.
3) Average_price – The average house price in the given area during the specified month.
4) Code – The unique area code (e.g., government statistical code) corresponding to each borough or region.
5) Houses_sold – The number of houses sold in the given area during the specified month.
6) No_of_crimes – The number of crimes recorded in the given area during the specified month.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.
The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.
Methodology 1. Data Extraction:
2. Data Cleansing and Transformation:
3. Exploratory Data Analysis (EDA):
4. Modeling and Prediction:
5. Report Generation:
Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.
Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.
Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
File: - Contains 1000 files with 100 prime numbers in each file - Format *.dat
Data Format: - Python Numpy Array - Float64
Example ( How to use ): - numpy.loadtxt( [filename] )
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.
This dataset took the data from the original dataset and convert the images to numpy arrays to make easier the processing of the umages.
10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10. 73257 digits for training, 26032 digits for testing, Comes in two formats:
1. Original images with character level bounding boxes.
2. MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides).
All the credit to:
http://ufldl.stanford.edu/housenumbers/
and,
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.