CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The data used in this paper is from the 16th issue of SDSS. SDSS-DR16 contains a total of 930,268 photometric images, with 1.2 billion observation sources and tens of millions of spectra. The data obtained in this paper is downloaded from the official website of SDSS. Specifically, the data is obtained through the SkyServerAPI structure by using SQL query statements in the subwebsite CasJobs. As the current SDSS photometric table PhotoObj can only classify all observed sources as point sources and surface sources, the target sources can be better classified as galaxies, stars and quasars through spectra. Therefore, we obtain calibrated sources in CasJobs by crossing SpecPhoto with the PhotoObj star list, and obtain target position information (right ascension and declination). Calibrated sources can tell them apart precisely and quickly. Each calibrated source is labeled with the parameter "Class" as "galaxy", "star", or "quasar". In this paper, observation day area 3462, 3478, 3530 and other 4 areas in SDSS-DR16 are selected as experimental data, because a large number of sources can be obtained in these areas to provide rich sample data for the experiment. For example, there are 9891 sources in the 3462-day area, including 2790 galactic sources, 2378 stellar sources and 4723 quasar sources. There are 3862 sources in the 3478 day area, including 1759 galactic sources, 577 stellar sources and 1526 quasar sources. FITS files are a commonly used data format in the astronomical community. By cross-matching the star list and FITS files in the local celestial region, we obtained images of 5 bands of u, g, r, i and z of 12499 galaxy sources, 16914 quasar sources and 16908 star sources as training and testing data.1.1 Image SynthesisSDSS photometric data includes photometric images of five bands u, g, r, i and z, and these photometric image data are respectively packaged in single-band format in FITS files. Images of different bands contain different information. Since the three bands g, r and i contain more feature information and less noise, Astronomical researchers typically use the g, r, and i bands corresponding to the R, G, and B channels of the image to synthesize photometric images. Generally, different bands cannot be directly synthesized. If three bands are directly synthesized, the image of different bands may not be aligned. Therefore, this paper adopts the RGB multi-band image synthesis software written by He Zhendong et al. to synthesize images in g, r and i bands. This method effectively avoids the problem that images in different bands cannot be aligned. The pixel of each photometry image in this paper is 2048×1489.1.2 Data tailoringThis paper first clipped the target image, image clipping can use image segmentation tools to solve this problem, this paper uses Python to achieve this process. In the process of clipping, we convert the right ascension and declination of the source in the star list into pixel coordinates on the photometric image through the coordinate conversion formula, and determine the specific position of the source through the pixel coordinates. The coordinates are regarded as the center point and clipping is carried out in the form of a rectangular box. We found that the input image size affects the experimental results. Therefore, according to the target size of the source, we selected three different cutting sizes, 40×40, 60×60 and 80×80 respectively. Through experiment and analysis, we find that convolutional neural network has better learning ability and higher accuracy for data with small image size. In the end, we chose to divide the surface source galaxies, point source quasars, and stars into 40×40 sizes.1.3 Division of training and test dataIn order to make the algorithm have more accurate recognition performance, we need enough image samples. The selection of training set, verification set and test set is an important factor affecting the final recognition accuracy. In this paper, the training set, verification set and test set are set according to the ratio of 8:1:1. The purpose of verification set is used to revise the algorithm, and the purpose of test set is used to evaluate the generalization ability of the final algorithm. Table 1 shows the specific data partitioning information. The total sample size is 34,000 source images, including 11543 galaxy sources, 11967 star sources, and 10490 quasar sources.1.4 Data preprocessingIn this experiment, the training set and test set can be used as the training and test input of the algorithm after data preprocessing. The data quantity and quality largely determine the recognition performance of the algorithm. The pre-processing of the training set and the test set are different. In the training set, we first perform vertical flip, horizontal flip and scale on the cropped image to enrich the data samples and enhance the generalization ability of the algorithm. Since the features in the celestial object source have the flip invariability, the labels of galaxies, stars and quasars will not change after rotation. In the test set, our preprocessing process is relatively simple compared with the training set. We carry out simple scaling processing on the input image and test input the obtained image.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The UniverseMachine is a self-consistent empirical model of galaxy formation in dark matter halos. It is constrained via observed galaxy stellar mass functions, star formation rates, clustering, luminosity functions, and quenched fractions. This dataset includes derived constraints on galaxy-halo relationships, star formation histories, merger histories, and predicted observables.Full mock catalogs with galaxy properties are available here.For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.edu
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data used in this tutorial are a subset of the data published previously in Training material for the course "Exome analysis with GALAXY". Credit for uploading the original data goes to Paolo Uva and Gianmauro Cuccuru!
Specifically, you may need the following datasets for following the tutorial:
Raw sequencing reads
Premapped sequencing reads
Reference sequence (human chromosome 8)
If you would just like to play with GEMINI rather than work through the full tutorial, you'll find below a prebuilt GEMINI database (for GEMINI version 0.20.1) for the family trio. You can start exploring this database without having to run GEMINI load and, in fact, without having to install GEMINI's bundled annotation data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises a curated collection of galaxy observations from the Sloan Digital Sky Survey (SDSS). It features photometric and spectroscopic data for 100 galaxies, specifically selected to cover a range of redshifts from 0 to 0.4. The dataset includes the following key parameters for each galaxy:
redshift
) and its error (redshift_error
).objid
: Unique identifier for the photometric object.specObjID
: Unique identifier for the spectroscopic object.ra
: Right ascension in decimal degrees.dec
: Declination in decimal degrees.class
: Classification of the object, all marked as 'GALAXY'.This dataset is intended for use in astronomical research and education, particularly in studies involving galaxy properties and distribution, cosmology, and machine learning applications such as redshift prediction models. The data is well-suited for developing and testing predictive models that estimate redshifts from photometric data, aiding in the expansion of accessible astronomical analysis tools.
The data was extracted using SQL queries against the public SDSS DR16 database, ensuring accuracy and relevance in current astronomical research contexts.
The dataset is made available under a CC0 license to promote open scientific research and collaboration within the astronomical community and beyond.
Bulk data of human pancreas The dataset from Fadista et al. (2014) contains raw read counts data from bulk RNA-seq of human pancreatic islets to study glucose metabolism in healthy and hyper-hypoglycemic conditions. For the purpose of this vignette, the dataset is pre-processed and made available on the data download page. In addition to read counts, this dataset also contains HbA1c levels, BMI, gender and age information for each subject. Single Cell Data of Human Pancreas The single cell data are from Segerstolpe et al. (2016), which constrains read counts for 25453 genes across 2209 cells. Here we only include the 1097 cells from 6 healthy subjects. The read counts are available on the data download page, in the form of an ExpressionSet. Another single cell data is from Xin et al. (2016), which have 39849 genes and 1492 cells. The read counts are available on the data download page, in the form of an ExpressionSet. The deconvolution of 89 subjects from Fadista et al. (2014) are preformed with bulk data GSE50244.bulk.eset and single cell reference EMTAB.eset. We constrained our estimation on 6 major cell types: alpha, beta, delta, gamma, acinar and ductal, which make up over 90% of the whole islet.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Galaxy Zoo team regularly receives requests for subject images for various versions of Galaxy Zoo, in order to facilitate other investigations, e.g. machine learning projects. This repository is an updated attempt to provide those in a way that is useful to the wider community.
There are 243,434 images in total. This is off by about 0.08% from the total count in the tables - it's not clear what the cause of the discrepancy is
The images are available in the file images_gz2.
The most recent and reliable source for morphology measurements is "GZ2 - Table 1 - Normal-depth sample with new debiasing method – CSV" (from Hart et al. 2016), which is available at data.galaxyzoo.org To cross-reference the images with Table 1, this sample includes another CSV table (gz2_filename_mapping.csv) which contains three columns and 355,990 rows. The columns are:
They are the "original" sample of subject images in Galaxy Zoo 2 (Willett et al. 2013, MNRAS, 435, 2835, DOI: 10.1093/mnras/stt1458) as identified in Table 1 of Willett et al. and also in Hart et al. (2016, MNRAS, 461, 3663, DOI: 10.1093/mnras/stw1588).
I want to know if it's possible to cluster the images in galaxy shape types of Hubble - de Vaucouleurs Galaxy Morphology Diagram:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6067505%2F8ac7df09aa0f85a1a07ac9dc0a81b57f%2FHubble_-_de_Vaucouleurs_Galaxy_Morphology_Diagram.png?generation=1611680439647479&alt=media" alt="">
If this three are not enough and you want to improve your notebook is possible to add:
Didn't add this to the first clusters due to depending on the angle of the galaxy some lenticulars may seem Ellipticals or Spirals, is hard to see always the arms of spiral galaxies and is hard to determine if a galaxy is tiny or big with just a photography and nothing to compare.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
to date, genome assembly of non-model organisms is usually not at chromosomal level and higly fragmented. this fragmentation is recognized to be, in part, the result of a bad assembly of the transposable elements (tes) copies, increasing the difficulty to detect and annotate them.in this context, we designed a new bioinformatics pipeline named pirate for detect, classify and annotate tes of non-model organisms. pirate combines multiple analysis packages representing all the major approaches for te detection. the goal is to promote the detection of complete te sequences of every te families. the detection of complete te sequences, bearing recognizable conserved domains or specific motifs, allows to facilitate the classification step. the classification step of pirate has been optimized for algal genomes.each tools used by pirate are automated into a stand-alone galaxy. this pirate-galaxy can be used through a virtual machine, which can be download below.this pirate-galaxy is a suitable and flexible platform to study tes in the genome of every organisms.you can find a tutorial below.please contact us if you have any issues or comments : berthelier.j [at] laposte.net or gregory.carrier [at] ifremer.fror you can leave a message on github: https://github.com/jberthelier/pirate/issues
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data here is a copy of the corresponding SRR records in the NCBI SRA. The duplication serves a dual purpose:
Download Free Sample
The online gambling market is expected to grow at a CAGR of 11% during the forecast period. This market growth can be attributed to various factors including rising popularity of the freemium model.
The online gambling market report offers several other valuable insights such as:
CAGR of the market during the forecast period 2020-2024
Detailed information on factors that will drive online gambling market growth during the next five years
Precise estimation of the online gambling market size and its contribution to the parent market
Accurate predictions on upcoming trends and changes in consumer behavior
The growth of the online gambling market industry across APAC, Europe, MEA, North America, and South America
A thorough analysis of the market’s competitive landscape and detailed information on vendors
Comprehensive details of factors that will challenge the growth of online gambling market vendors
VizieR Online Data Catalog: Galactic interstellar dust Gaia-2MASS 3D maps(Lallement R.+, 2019)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We downloaded the complete mitochondrial genome data of 18 Engraulidae fish species from the NCBI database (https://www.ncbi.nlm.nih.gov/). These files were stored in the “Download data” folder. Subsequently, we reannotated these mitochondrial genomes using the MITOS2 online tool available on the Galaxy website (https://usegalaxy.org/) and manually modified the original gb files to adjust the inaccurately annotated control regions and to add the annotation information for the light-strand replication origin. The revised files were saved in the “Reannotation” folder and were used for subsequent analyses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to the manuscript titled "The Massive and Distant Clusters of WISE Survey. XII. Exploring X-ray AGN in Dynamically Active Massive Galaxy Clusters at z ∼ 1," which has been submitted to The Astrophysical Journal. To reproduce the plots and access the catalogs used in the paper, please download and extract all the zip folders and the Jupyter Notebook "madcows_master_notebook.ipynb" provided under the same directory. Then, open the notebook and follow the instructions provided within. If you encounter any issues, please contact the corresponding author for assistance.
In the fourth quarter of 2024, Samsung shipped around ** million smartphones, a decrease from the both the previous quarter and the same quarter of the previous year. Samsung’s sales consistently place the smartphone giant among the top three smartphone vendors in the world, alongside Xiaomi and Apple. Samsung smartphone sales – how many phones does Samsung sell? Global smartphone sales reached over *** billion units during 2024. While the global smartphone market is led by Samsung and Apple, Xiaomi has gained ground following the decline of Huawei. Together, these three companies hold more than ** percent of the global smartphone market share.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16012776%2Fdb7fd8faf4277c85822f8bbfe5e113d2%2Farnaud-mariat-45Z6hW1dQMI-unsplash.jpg?generation=1690636699354713&alt=media" alt="">
This dataset consists of 100,000 observations from the Data Release (DR) 18 of the Sloan Digital Sky Survey (SDSS). Each observation is described by 42 features and 1 class column classifying the observation as either:
You can read more about the features below:
The run number refers to a specific period in which the SDSS observes a part of the sky. SDSS is divided into several runs, each lasting for a certain amount of time, which are then combined to cover an extensive portion of the sky. The rerun number refers to the reprocessing of the data obtained.
In each run, multiple charge-coupled device (CCD) cameras are arranged into a column which are responsible for imaging a specific portion of the sky. camcol refers to the camera column number which imaged a specific observation. A field is a specific portion of the sky that is imaged during a single exposure of the telescope. The entire sky is divided into a portion of fields and the field number column refers to the field or portion of the sky from which an observation was obtained.
A number of physical glass plates are mounted on the telescope, each containing a number of optical fibers corresponding to a specific position in the sky. When light hits these optical fibers, it is sent to spectrographs for analysis. plate number and fiberID refer to the number of the plate and the ID of the optical fiber responsible for gathering light from the celestial object respectively.
Modified Julian Date represents the number of days that have passed since midnight Nov. 17, 1858. It is used in SDSS to keep track of the time of each observation.
The petrosian radius is a measure of the size of a galaxy, and it is calculated using the petrosian flux profile. The petrosian flux profile measures how the brightness of an object varies with distance from its center. The petrosian radius is defined as the distance from the galaxy's center where the ratio of the local surface brightness to the average surface brightness reaches a certain predefined value. The local surface brightness refers to the brightness of a specific small region or pixel on the surface of an extended object. It is a measure of how much light is detected from that particular region. The average surface brightness, on the other hand, represents the mean or average brightness measured over the entire surface of the extended object. It is the total amount of light received from the object divided by its total area.
These parameters help in characterizing the properties of celestial objects, especially when studying their morphologies, sizes, and how they evolve over time.
These parameters help in studying the photometric properties of the celestial objects, particularly in analyzing the brightness, colors, and spectral energy distribution of the objects. By using petrosian fluxes in different bands, astronomers can obtain a comprehensive view of an object's light emission across the electromagnetic spectrum.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information about the dataset files:
1) pancan_rnaseq_freeze.tsv.gz: Publicly available gene expression data for the TCGA Pan-cancer dataset. File: PanCanAtlas EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [http://api.gdc.cancer.gov/data/3586c0da-64d0-4b74-a449-5ff4d9136611] [https://doi.org/10.1016/j.celrep.2018.03.046]
2) pancan_mutation_freeze.tsv.gz: Publicly available Mutational information for TCGA Pan-cancer dataset. File: mc3.v0.2.8.PUBLIC.maf.gz was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc] [https://doi.org/10.1016/j.celrep.2018.03.046]
3) pancan_GISTIC_threshold.tsv.gz: Publicly available Gene- level copy number information of the TCGA Pan-cancer dataset. This file is processed using script process_copynumber.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. The files copy_number_loss_status.tsv.gz and copy_number_gain_status.tsv.gz generated from this data are used as inputs in our Galaxy pipeline. [https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443] [https://doi.org/10.1016/j.celrep.2018.03.046]
4) mutation_burden_freeze.tsv.gz: Publicly available Mutational information for TCGA Pan-cancer dataset mc3.v0.2.8.PUBLIC.maf.gz was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [https://github.com/greenelab/pancancer/][http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc] [https://doi.org/10.1016/j.celrep.2018.03.046]
5) sample_freeze.tsv or sample_freeze_version4_modify.tsv: The file lists the frozen samples as determined by TCGA PanCancer Atlas consortium along with raw RNAseq and mutation data. These were previously determined and included for all downstream analysis All other datasets were processed and subset according to the frozen samples.[https://github.com/greenelab/pancancer/]
6) vogelstein_cancergenes.tsv: compendium of OG and TSG used for the analysis. [https://github.com/greenelab/pancancer/]
7) CCLE_DepMap_18Q1_maf_20180207.txt.gz Publicly available Mutational data for CCLE cell lines from Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://depmap.org/portal/download/api/download/external?file_name=ccle%2FCCLE_DepMap_18Q1_maf_20180207.txt]
8) ccle_rnaseq_genes_rpkm_20180929.gct.gz: Publicly available Expression data for 1019 cell lines (RPKM) from Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://depmap.org/portal/download/api/download/external?file_name=ccle%2Fccle_2019%2FCCLE_RNAseq_genes_rpkm_20180929.gct.gz]
9) CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct: Publicly available merged Mutational and copy number alterations that include gene amplifications and deletions for the CCLE cell lines. This data is represented in the binary format and provided by the Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://data.broadinstitute.org/ccle_legacy_data/binary_calls_for_copy_number_and_mutation_data/CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct]
10) GDSC_cell_lines_EXP_CCLE_names.csv.gz Publicly available RMA normalized expression data for Genomics of Drug Sensitivity in Cancer(GDSC) cell-lines. File gdsc_cell_line_RMA_proc_basalExp.csv was downloaded. This data was subsetted to 389 cell lines that are common among CCLE and GDSC. All the GDSC cell line names were replaced with CCLE cell line names for further processing. [https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/preprocessed/Cell_line_RMA_proc_basalExp.txt.zip]
11) GDSC_CCLE_common_mut_cnv_binary.csv.gz: A subset of merged Mutational and copy number alterations that include gene amplifications and deletions for common cell lines between GDSC and CCLE. This file is generated using CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct and a list of common cell lines.
12) gdsc1_ccle_pharm_fitted_dose_data.txt.gz: Pharmacological data for GDSC1 cell lines. [ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC1_fitted_dose_response_15Oct19.xlsx]
13) gdsc2_ccle_pharm_fitted_dose_data.txt.gz: Pharmacological data for GDSC2 cell lines. [ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC2_fitted_dose_response_15Oct19.xlsx]
14) compounds.csv: list of pharmacological compounds tested for our analysis
15) tcga_dictonary.tsv: list of cancer types used in the analysis.
16) seg_based_scores.tsv: Measurement of total copy number burden, Percent of genome altered by copy number alterations. This file was used as part of the Pancancer analysis by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [https://github.com/greenelab/pancancer/]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data reproduction package for the paper "X-ray investigation of the remarkable galaxy group Nest200047" by Anwesh Majumder, M.W. Wise, A. Simionescu, M.N. de Vries (accepted in MNRAS).
Raw data: The Chandra and XMM data can be downloaded from https://cda.harvard.edu/chaser/ and http://nxsa.esac.esa.int/nxsa-web/#search using the observation IDs. See the 'Data' section of the paper to know what to download. Any additional data source has been mentioned in the paper as footnotes.
Software required:
CIAO (https://cxc.cfa.harvard.edu/ciao/)
XMM-SAS (https://www.cosmos.esa.int/web/xmm-newton/download-and-install-sas)
Jupyter Notebook and Python-3.9 or higher (https://jupyter.org)
SPEX (https://spex-xray.github.io/spex-help/index.html)
PyProffit (https://pyproffit.readthedocs.io/en/latest/index.html)
CXBups (https://zenodo.org/records/2575495)
There are README files inside directories.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets for reproducing the results of the manuscript "DIMet : An open-source tool for Differential analysis of targeted Isotope-labeled Metabolomics data". DIMet tool is available here, and the tool documentation is accessible in the DIMet wiki page and in its Galaxy site.
Users of the Galaxy version of DIMet:
Users of the command-line version of DIMet:
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset delivers a single, end-to-end resource for training and benchmarking facial liveness-detection systems. By aggregating live sessions and eleven realistic presentation-attack classes into one collection, it accelerates development toward iBeta Level 1/2 compliance and strengthens model robustness against the full spectrum of spoofing tactics
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20109613%2F6432e95d7b7fef1d271457f172e11e0c%2FFrame%20103-3.png?generation=1753867895186569&alt=media" alt="">
Modern certification pipelines demand proof that a system resists all common attack vectors—not just prints or replays. This dataset delivers those vectors in one place, allowing you to: - Benchmark a model’s true generalisation - Fine-tune against rare but high-impact threats (e.g., silicone or textile masks) - Streamline audits by demonstrating coverage of every ISO 30107-3 attack category
Ideal for companies pursuing or maintaining iBeta Level 1/2 certification, research groups exploring new PAD architectures, and vendors stress-testing production face-verification pipelines
This dataset’s scale, breadth of attack types, and real-world capture conditions make it indispensable for anyone building or evaluating biometric anti-spoofing solutions. Deploy it to harden your systems against today’s—and tomorrow’s—most sophisticated presentation attacks
Not seeing a result you expected?
Learn how you can add new datasets to our index.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The data used in this paper is from the 16th issue of SDSS. SDSS-DR16 contains a total of 930,268 photometric images, with 1.2 billion observation sources and tens of millions of spectra. The data obtained in this paper is downloaded from the official website of SDSS. Specifically, the data is obtained through the SkyServerAPI structure by using SQL query statements in the subwebsite CasJobs. As the current SDSS photometric table PhotoObj can only classify all observed sources as point sources and surface sources, the target sources can be better classified as galaxies, stars and quasars through spectra. Therefore, we obtain calibrated sources in CasJobs by crossing SpecPhoto with the PhotoObj star list, and obtain target position information (right ascension and declination). Calibrated sources can tell them apart precisely and quickly. Each calibrated source is labeled with the parameter "Class" as "galaxy", "star", or "quasar". In this paper, observation day area 3462, 3478, 3530 and other 4 areas in SDSS-DR16 are selected as experimental data, because a large number of sources can be obtained in these areas to provide rich sample data for the experiment. For example, there are 9891 sources in the 3462-day area, including 2790 galactic sources, 2378 stellar sources and 4723 quasar sources. There are 3862 sources in the 3478 day area, including 1759 galactic sources, 577 stellar sources and 1526 quasar sources. FITS files are a commonly used data format in the astronomical community. By cross-matching the star list and FITS files in the local celestial region, we obtained images of 5 bands of u, g, r, i and z of 12499 galaxy sources, 16914 quasar sources and 16908 star sources as training and testing data.1.1 Image SynthesisSDSS photometric data includes photometric images of five bands u, g, r, i and z, and these photometric image data are respectively packaged in single-band format in FITS files. Images of different bands contain different information. Since the three bands g, r and i contain more feature information and less noise, Astronomical researchers typically use the g, r, and i bands corresponding to the R, G, and B channels of the image to synthesize photometric images. Generally, different bands cannot be directly synthesized. If three bands are directly synthesized, the image of different bands may not be aligned. Therefore, this paper adopts the RGB multi-band image synthesis software written by He Zhendong et al. to synthesize images in g, r and i bands. This method effectively avoids the problem that images in different bands cannot be aligned. The pixel of each photometry image in this paper is 2048×1489.1.2 Data tailoringThis paper first clipped the target image, image clipping can use image segmentation tools to solve this problem, this paper uses Python to achieve this process. In the process of clipping, we convert the right ascension and declination of the source in the star list into pixel coordinates on the photometric image through the coordinate conversion formula, and determine the specific position of the source through the pixel coordinates. The coordinates are regarded as the center point and clipping is carried out in the form of a rectangular box. We found that the input image size affects the experimental results. Therefore, according to the target size of the source, we selected three different cutting sizes, 40×40, 60×60 and 80×80 respectively. Through experiment and analysis, we find that convolutional neural network has better learning ability and higher accuracy for data with small image size. In the end, we chose to divide the surface source galaxies, point source quasars, and stars into 40×40 sizes.1.3 Division of training and test dataIn order to make the algorithm have more accurate recognition performance, we need enough image samples. The selection of training set, verification set and test set is an important factor affecting the final recognition accuracy. In this paper, the training set, verification set and test set are set according to the ratio of 8:1:1. The purpose of verification set is used to revise the algorithm, and the purpose of test set is used to evaluate the generalization ability of the final algorithm. Table 1 shows the specific data partitioning information. The total sample size is 34,000 source images, including 11543 galaxy sources, 11967 star sources, and 10490 quasar sources.1.4 Data preprocessingIn this experiment, the training set and test set can be used as the training and test input of the algorithm after data preprocessing. The data quantity and quality largely determine the recognition performance of the algorithm. The pre-processing of the training set and the test set are different. In the training set, we first perform vertical flip, horizontal flip and scale on the cropped image to enrich the data samples and enhance the generalization ability of the algorithm. Since the features in the celestial object source have the flip invariability, the labels of galaxies, stars and quasars will not change after rotation. In the test set, our preprocessing process is relatively simple compared with the training set. We carry out simple scaling processing on the input image and test input the obtained image.