A tracer breakthrough curve (BTC) for each sampling station is the ultimate goal of every quantitative hydrologic tracing study, and dataset size can critically affect the BTC. Groundwater-tracing data obtained using in situ automatic sampling or detection devices may result in very high-density data sets. Data-dense tracer BTCs obtained using in situ devices and stored in dataloggers can result in visually cluttered overlapping data points. The relatively large amounts of data detected by high-frequency settings available on in situ devices and stored in dataloggers ensure that important tracer BTC features, such as data peaks, are not missed. Alternatively, such dense datasets can also be difficult to interpret. Even more difficult, is the application of such dense data sets in solute-transport models that may not be able to adequately reproduce tracer BTC shapes due to the overwhelming mass of data. One solution to the difficulties associated with analyzing, interpreting, and modeling dense data sets is the selective removal of blocks of the data from the total dataset. Although it is possible to arrange to skip blocks of tracer BTC data in a periodic sense (data decimation) so as to lessen the size and density of the dataset, skipping or deleting blocks of data also may result in missing the important features that the high-frequency detection setting efforts were intended to detect. Rather than removing, reducing, or reformulating data overlap, signal filtering and smoothing may be utilized but smoothing errors (e.g., averaging errors, outliers, and potential time shifts) need to be considered. Appropriate probability distributions to tracer BTCs may be used to describe typical tracer BTC shapes, which usually include long tails. Recognizing appropriate probability distributions applicable to tracer BTCs can help in understanding some aspects of the tracer migration. This dataset is associated with the following publications: Field, M. Tracer-Test Results for the Central Chemical Superfund Site, Hagerstown, Md. May 2014 -- December 2015. U.S. Environmental Protection Agency, Washington, DC, USA, 2017. Field, M. On Tracer Breakthrough Curve Dataset Size, Shape, and Statistical Distribution. ADVANCES IN WATER RESOURCES. Elsevier Science Ltd, New York, NY, USA, 141: 1-19, (2020).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents median household incomes for various household sizes in Lower Frederick Township, Pennsylvania, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.
Key observations
https://i.neilsberg.com/ch/lower-frederick-township-pa-median-household-income-by-household-size.jpeg" alt="Lower Frederick Township, Pennsylvania median household income, by household size (in 2022 inflation-adjusted dollars)">
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Household Sizes:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Lower Frederick township median household income. You can refer the same here
The study investigated whether one week of exposure to images of bodies of different weights affects the way healthy adult women perceive their own body size.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Current understanding of animal population responses to rising temperatures is based on the assumption that biological rates such as metabolism, which governs fundamental ecological processes, scale independently with body size and temperature, despite empirical evidence for interactive effects. Here we investigate the consequences of interactive temperature- and size-scaling of vital rates for the dynamics of populations experiencing warming using a stage-structured consumer-resource model. We show that interactive scaling alters population and stage-specific responses to rising temperatures, such that warming can induce shifts in population regulation and stage-structure, influence community structure and govern population responses to mortality. Analyzing experimental data for 20 fish species, we found size-temperature interactions in intraspecific scaling of metabolic rate to be common. Given the evidence for size-temperature interactions and the ubiquity of size structure in animal populations, we argue that accounting for size-specific temperature effects is pivotal for understanding how warming affects animal populations and communities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Olympus VSI "multifile" test dataset
A publicly available dataset in the proprietary Olympus VSI format intended for testing file readers and other related mechanisms.
Key properties / requirements:
.vsi
file plus a subfolder structure containing one or more .ets
filesIMPORTANT: this dataset was heavily postprocessed (see the section below), it is purely meant to have a valid example of the file format structure.
Dataset Information
Image Dimensions
Software Version
The software used to acquire and postprocess the dataset.
Postprocessing
The following steps were applied to reduce the size of the dataset:
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
AI Training Dataset Market Size 2025-2029
The ai training dataset market size is valued to increase by USD 7.33 billion, at a CAGR of 29% from 2024 to 2029. Proliferation and increasing complexity of foundational AI models will drive the ai training dataset market.
Market Insights
North America dominated the market and accounted for a 36% growth during the 2025-2029.
By Service Type - Text segment was valued at USD 742.60 billion in 2023
By Deployment - On-premises segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 479.81 million
Market Future Opportunities 2024: USD 7334.90 million
CAGR from 2024 to 2029 : 29%
Market Summary
The market is experiencing significant growth as businesses increasingly rely on artificial intelligence (AI) to optimize operations, enhance customer experiences, and drive innovation. The proliferation and increasing complexity of foundational AI models necessitate large, high-quality datasets for effective training and improvement. This shift from data quantity to data quality and curation is a key trend in the market. Navigating data privacy, security, and copyright complexities, however, poses a significant challenge. Businesses must ensure that their datasets are ethically sourced, anonymized, and securely stored to mitigate risks and maintain compliance. For instance, in the supply chain optimization sector, companies use AI models to predict demand, optimize inventory levels, and improve logistics. Access to accurate and up-to-date training datasets is essential for these applications to function efficiently and effectively. Despite these challenges, the benefits of AI and the need for high-quality training datasets continue to drive market growth. The potential applications of AI are vast and varied, from healthcare and finance to manufacturing and transportation. As businesses continue to explore the possibilities of AI, the demand for curated, reliable, and secure training datasets will only increase.
What will be the size of the AI Training Dataset Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free SampleThe market continues to evolve, with businesses increasingly recognizing the importance of high-quality datasets for developing and refining artificial intelligence models. According to recent studies, the use of AI in various industries is projected to grow by over 40% in the next five years, creating a significant demand for training datasets. This trend is particularly relevant for boardrooms, as companies grapple with compliance requirements, budgeting decisions, and product strategy. Moreover, the importance of data labeling, feature selection, and imbalanced data handling in model performance cannot be overstated. For instance, a mislabeled dataset can lead to biased and inaccurate models, potentially resulting in costly errors. Similarly, effective feature selection algorithms can significantly improve model accuracy and reduce computational resources. Despite these challenges, advances in model compression methods, dataset scalability, and data lineage tracking are helping to address some of the most pressing issues in the market. For example, model compression techniques can reduce the size of models, making them more efficient and easier to deploy. Similarly, data lineage tracking can help ensure data consistency and improve model interpretability. In conclusion, the market is a critical component of the broader AI ecosystem, with significant implications for businesses across industries. By focusing on data quality, effective labeling, and advanced techniques for handling imbalanced data and improving model performance, organizations can stay ahead of the curve and unlock the full potential of AI.
Unpacking the AI Training Dataset Market Landscape
In the realm of artificial intelligence (AI), the significance of high-quality training datasets is indisputable. Businesses harnessing AI technologies invest substantially in acquiring and managing these datasets to ensure model robustness and accuracy. According to recent studies, up to 80% of machine learning projects fail due to insufficient or poor-quality data. Conversely, organizations that effectively manage their training data experience an average ROI improvement of 15% through cost reduction and enhanced model performance.
Distributed computing systems and high-performance computing facilitate the processing of vast datasets, enabling businesses to train models at scale. Data security protocols and privacy preservation techniques are crucial to protect sensitive information within these datasets. Reinforcement learning models and supervised learning models each have their unique applications, with the former demonstrating a 30% faster convergence rate in certain use cases.
Data annot
Since the 1940's, hydrologists have used aquifer tests to estimate the hydrogeologic properties near test wells. Results from these tests are recorded in various files, databases, reports and scientific publications. The U.S. Geological Survey (USGS), Lower Mississippi-Gulf Water Science Center (LMG) is aggregating all aquifer test results from Alabama, Arkansas, Louisiana, Mississippi and Tennessee into a single dataset that is publicly available in a machine-readable format. This dataset contains information and results from 2,245 aquifer tests compiled for aquifers located in the LMG-Hydrogeologic Aquifer Test Dataset - December 2020. Descriptive statistics for the December 2020 dataset are presented in Table 1 (below) and in the Summary_Readme.pdf. Additionally, this dataset contains 6 attribute tables (.txt files) with additional information for various fields, a zip file containing the geospatial data, and the companion attribute table as a .txt file. THE LMG-HYDROGEOLOGIC AQUIFER TEST DATASET – DECEMBER 2020 IS AVAILABLE IN TWO FORMATS: 1) a tab delimited text (.txt) UTF-8 file and 2) an ESRI GIS point shapefile. FIELDS INCLUDED IN THE LMG-HYDROGEOLOGIC AQUIFER TEST DATASET – DECEMBER 2020: [a complete list of field names, their definitions and units are listed in the Summary_Readme.pdf file] Location data: USGS site identification number, local identification name, Public Land Survey System number, latitude, longitude, State and county. Well construction data: Construction date, well depth, Diameter of well, diameter of casing, depth to top of opening (screen) interval, depth to bottom of opening interval and length of the open interval. Aquifer data: Local aquifer name and code, national aquifer name and code, top of aquifer (altitude), bottom of aquifer, and thickness of aquifer. Groundwater test data: Test date, yield/discharge, length of time associated with yield, static water-level in feet below land surface, production water-level in feet below land surface associated with yield, drawdown associated with yield. Hydrogeologic data: Specific capacity, transmissivity, horizontal Conductivity, vertical conductivity, permeability and storage coefficient. Ancillary data: Method of test analysis and data source reference. DESCRIPTIONS OF ATTACHED FILES: Summary_Readme.pdf: a Portable Document Format (PDF) file with field names, definitions and units for the aquifer test dataset and the associated attribute tables. This file also contains summary statistics for aquifer test compiled through December 2020. LMG-HydrogeologicAqfrTestDataset_Dec2020.txt: a tab-delimited, UTF-8 text file of the attribute table associated with the LMG-HydrogeologicTestData_Dec2020 geospatial dataset. AtbtTbl_AqfrCd_Readme.txt: an UTF-8 text file containing information from the National Water Information System: Help System web page about USGS groundwater codes. (accessed December 4, 2019 at https://help.waterdata.usgs.gov/codes-and-parameters) AtbtTbl_FipsGeographyCodes.txt: a tab-delimited, UTF-8 text file of FIPS (Federal Information Processing Standards) codes, uniquely identifying States, counties and county equivalents in the United States. Note: to reduce the size of this file, city codes were removed. (accessed January 8, 2020 at https://www.census.gov/geographies/reference-files/2017/demo/popest/2017-fips.html). AtbtTbl_LocalAqfrCodes.txt: a tab-delimited, UTF-8 text file of eight-character string identifying an aquifer. Codes are defined by the "Catalog of Aquifer Names and Geologic Unit Codes used by the USGS. (accessed December 4, 2019 at https://help.waterdata.usgs.gov/aqfr_cd) AtbtTbl_NatAqfrCodes.txt: a tab-delimited, UTF-8 text file of ten-character strings identifying a National aquifer, or principal aquifer of the United States, that are defined as regionally extensive aquifers or aquifer systems that have the potential to be used as a source of potable water. (accessed December 4, 2019 at https://water.usgs.gov/ogw/NatlAqCode-reflist.html) AtbtTbl_TstMthdCodes.txt: a tab-delimited, UTF-8 text file of codes identifying the aquifer test analysis method when reported in the associated reference. AtbtTbl_DataRefNo.txt: a tab-delimited, UTF-8 text file of references for the source of the associated aquifer test result. CAVEAT: Some hydrogeologic test results reported in this dataset have not been through the USGS data review and approval process to receive the Director’s approval. Any such data are considered PROVISIONAL and subject to revision. PROVISIONAL data are released on the condition that neither the USGS nor the United States Government may be held liable for any damages resulting from its use. NOTE: -- If you have data you would like added to this dataset or have found an error, please contact the USGS so we may incorporate them into the next version of the LMG- Hydrogeologic Aquifer Test dataset. Table 1. Summary-descriptive statistics for the LMG-HYDROGEOLOGIC AQUIFER TEST DATASET – December 2020. [USGS, U.S. Geological Survey; NWIS, National Water Information System; n, number of wells] USGS-NWIS NATIONAL STANDARD AQUIFER NAME AND CODE n MAXIMUM MINIMUM MEAN MEDIUM DEVIATION Specific capacity (gallons per minute per foot) All well data 1733 15000 0.0025 84 8.7 552 Alluvial aquifers (N100ALLUVL) 21 723 0.98 57 12 161 Mississippi River Valley alluvial aquifer (N100MSRVVL) 185 10000 0.06 265 72 864 Other aquifers (N9999OTHER) 3 50 1.20 18 2.1 28 Coastal lowlands aquifer system (S100CSLLWD) 913 15000 0.05 93 12 645 Mississippi embayment aquifer system (S100MSEMBM) 429 641 0.01 13 4 44 Southeastern Coastal Plain aquifer system (S100SECSLP) 99 71 0.10 6.2 3.7 8.7 Ozark Plateaus aquifer system (S400OZRKPL) 30 16 0.16 3.6 1.7 4.2 Edwards-Trinity aquifer system (S500EDRTRN) 0 -- -- -- -- -- Unknown National aquifer 53 972 0.0025 59 10 151 Transmissivity (square feet per day) All well data 1549 260678 1.3 12366 5080 20711 Alluvial aquifers (N100ALLUVL) 26 41700 450 9294 8422 8420 Mississippi River Valley alluvial aquifer (N100MSRVVL) 146 171800 236 31934 24431 28074 Other aquifers (N9999OTHER) 4 26000 24 8506 4000 11822 Coastal lowlands aquifer system (S100CSLLWD) 703 260678 1.5 15585 8000 23875 Mississippi embayment aquifer system (S100MSEMBM) 456 36000 1.3 4618 2406 6006 Southeastern Coastal Plain aquifer system (S100SECSLP) 114 80000 5.00 3652 1340 8838 Ozark Plateaus aquifer system (S400OZRKPL) 36 4983 42 1056 534 1262 Edwards-Trinity aquifer system (S500EDRTRN) 1 161 161 161 161 -- Unknown National aquifer 63 84486 5.9 11103 4345 16908 Horizontal hydraulic conductivity (feet per day) All well data 749 1077 0.01 72 50 82 Alluvial aquifers (N100ALLUVL) 6 321 39.88 160 176 106 Mississippi River Valley alluvial aquifer (N100MSRVVL) 46 400 6.88 182 190 134 Other aquifers (N9999OTHER) 4 269 92.00 183 185 95 Coastal lowlands aquifer system (S100CSLLWD) 268 1077 1.00 93 81 85 Mississippi embayment aquifer system (S100MSEMBM) 271 370 0.02 54 43 52 Southeastern Coastal Plain aquifer system (S100SECSLP) 109 230 0.30 31 14 36 Ozark Plateaus aquifer system (S400OZRKPL) 33 1.9 0.01 0.54 0.31 0.58 Edwards-Trinity aquifer system (S500EDRTRN) 0 -- -- -- -- -- Unknown National aquifer 12 267 16.00 104 54 99 Permeability (gallons per day per square feet) All well data 497 8375 0.12 736 400 947 Alluvial aquifers (N100ALLUVL) 12 2400 328 1307 1270 602 Mississippi River Valley alluvial aquifer (N100MSRVVL) 43 7891 110 1926 1785 1174 Other aquifers (N9999OTHER) 0 -- -- -- -- -- Coastal lowlands aquifer system (S100CSLLWD) 263 8375 11 796 636 973 Mississippi embayment aquifer system (S100MSEMBM) 165 1300 0.12 235 177 237 Southeastern Coastal Plain aquifer system (S100SECSLP) 0 -- -- -- -- -- Ozark Plateaus aquifer system (S400OZRKPL) 0 -- -- -- -- -- Edwards-Trinity aquifer system (S500EDRTRN) 0 -- -- -- -- -- Unknown National aquifer 14 4158 201 1390 1204 963 Storage coefficient (dimensionless) All well data 490 1.62 6.30E-10 0.0083 0.00051 0.081 Alluvial aquifers (N100ALLUVL) 21 0.08 0.0002 0.0053 0.00054 0.017 Mississippi River Valley alluvial aquifer (N100MSRVVL) 82 0.09 0.0001 0.0081 0.0013 0.016 Other aquifers (N9999OTHER) 1 0.0006 0.0006 0.0006 0.0006 -- Coastal lowlands aquifer system (S100CSLLWD) 233 0.72 6.30E-10 0.0054 0.0005 0.048 Mississippi embayment aquifer system (S100MSEMBM) 100 1.62 0.000012 0.0180 0.00027 0.16 Southeastern Coastal Plain aquifer system (S100SECSLP) 16 0.006 0.00003 0.0005 0.0002 0.0015 Ozark Plateaus aquifer system (S400OZRKPL) 0 -- -- -- -- -- Edwards-Trinity aquifer system (S500EDRTRN) 0 -- -- -- -- -- Unknown National aquifer 37 0.05 0.000078 0.0062 0.00067 0.014 This dataset was developed as part of the U.S. Geological Survey, Mississippi Alluvial Plain Regional Water-Availability Study.
(1) This is the dataset simulated by high resolution atmospheric model of which horizontal resolution is 60km-mesh over the globe (GCM), and 20km over Japan and surroundings (RCM), respetively. The climate of the latter half of the 20th century is simulated for 6000 years (3000 years for the Japan area), and the climates 1.5 K(*2), 2 K (*1) and 4 K warmer than the pre-industrial climate are simulated for 1566, 3240 and 5400 years, respectivley, to see the effect of global warming. (2) Huge number of ensembles enable not only with statistics but also with high accuracy to estimate the future change of extreme events such as typoons and localized torrential downpours. In addtion, this dataset provides the highly reliable information on the impact of natural disasters due to climate change on future societies. (3) This dataset provides the climate projections which adaptations against global warming are based on in various fields, for example, disaster prevention, urban planning, environmetal protection, and so on. It would realize the global warming adaptations consistent not only among issues but also among regions. (4) Total size of this dataset is 3 PB (3 × the 15th power of 10 bytes).
(*1) Datasets of the climates 2K warmer than the pre-industorial climate (d4PDF 2K) is available on 10th August, 2018. (*2) Datasets of the climates 1.5K warmer than the pre-industorial climate (d4PDF 1.5K) is available on 8th February, 2022.
Recombinant adeno-associated virus (rAAV) vectors mediate long-term gene transfer without any known toxicity. The primary limitation of rAAV has been the small size of the virion (20 nm), which only permits the packaging of 4.7 kilobases (kb) of exogenous DNA, including the promoter, the polyadenylation signal and any other enhancer elements that might be desired. Two recent reports (D Duan et al: Nat Med 2000, 6:595-598; Z Yan et al: Proc Natl Acad Sci USA 2000, 97:6716-6721) have exploited a unique feature of rAAV genomes, their ability to link together in doublets or strings, to bypass this size limitation. This technology could improve the chances for successful gene therapy of diseases like cystic fibrosis or Duchenne muscular dystrophy that lead to significant pulmonary morbidity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Deep learning (DL) techniques have seen tremendous interest in medical imaging, particularly in the use of convolutional neural networks (CNNs) for the development of automated diagnostic tools. The facility of its non-invasive acquisition makes retinal fundus imaging particularly amenable to such automated approaches. Recent work in the analysis of fundus images using CNNs relies on access to massive datasets for training and validation, composed of hundreds of thousands of images. However, data residency and data privacy restrictions stymie the applicability of this approach in medical settings where patient confidentiality is a mandate. Here, we showcase results for the performance of DL on small datasets to classify patient sex from fundus images—a trait thought not to be present or quantifiable in fundus images until recently. Specifically, we fine-tune a Resnet-152 model whose last layer has been modified to a fully-connected layer for binary classification. We carried out several experiments to assess performance in the small dataset context using one private (DOVS) and one public (ODIR) data source. Our models, developed using approximately 2500 fundus images, achieved test AUC scores of up to 0.72 (95% CI: [0.67, 0.77]). This corresponds to a mere 25% decrease in performance despite a nearly 1000-fold decrease in the dataset size compared to prior results in the literature. Our results show that binary classification, even with a hard task such as sex categorization from retinal fundus images, is possible with very small datasets. Our domain adaptation results show that models trained with one distribution of images may generalize well to an independent external source, as in the case of models trained on DOVS and tested on ODIR. Our results also show that eliminating poor quality images may hamper training of the CNN due to reducing the already small dataset size even further. Nevertheless, using high quality images may be an important factor as evidenced by superior generalizability of results in the domain adaptation experiments. Finally, our work shows that ensembling is an important tool in maximizing performance of deep CNNs in the context of small development datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents median household incomes for various household sizes in Lower Heidelberg Township, Pennsylvania, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.
Key observations
https://i.neilsberg.com/ch/lower-heidelberg-township-pa-median-household-income-by-household-size.jpeg" alt="Lower Heidelberg Township, Pennsylvania median household income, by household size (in 2022 inflation-adjusted dollars)">
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Household Sizes:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Lower Heidelberg township median household income. You can refer the same here
https://doi.org/10.5061/dryad.wwpzgmst6
The repository includes the following files:
AlexanderBodySize_all.csv: grasshopper body size dataset for the Gordon Alexander collection at the University of Colorado Museum of Natural History.
AlexanderBodySize_wClimate.csv: abbreviated grasshopper body size dataset with appended climate data.
HopperData_Sept2019.csv: data from grasshopper phenological surveys from Buckley et al (2021).
Levy_FemaleGradientDataGrasshopper.csv: reproductive data from Levy and Nufio (2014).
NiwotClimateFilled.csv: climate data for study sites.
Data are described in the AlexanderBodysizeData_Readme.csv file and below.
AlexanderBodySize_all.csv
| attributeName | attributeLabel | attributeDefinition | storageType | formatString | unit | mi...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Modality A: Near-Infrared (NIR)
Modality B: three colour channels (in B-G-R order)
Modality A: Fluorescence Images
Modality B: Quantitative Phase Images (QPI)
Modality A: Second Harmonic Generation (SHG)
Modality B: Bright-Field (BF)
The evaluation set created from the above three publicly available datasets consists of images undergone 4 levels of (rigid) transformations of increasing size of displacement. The level of transformations is determined by the size of the rotation angle θ and the displacement tx & ty, detailed in this table. Each image sample is transformed exactly once at each transformation level so that all levels have the same number of samples.
In total, it contains 864 image pairs created from the aerial dataset, 5040 image pairs created from the cytological dataset, and 536 image pairs created from the histological dataset. Each image pair consists of a reference patch \(I^{\text{Ref}}\) and its corresponding initial transformed patch \(I^{\text{Init}}\) in both modalities, along with the ground-truth transformation parameters to recover it.
Scripts to calculate the registration performance and to plot the overall results can be found in https://github.com/MIDA-group/MultiRegEval, and instructions to generate more evaluation data with different settings can be found in https://github.com/MIDA-group/MultiRegEval/tree/master/Datasets#instructions-for-customising-evaluation-data.
Metadata
In the *.zip
files, each row in {Zurich,Balvan}_patches/fold[1-3]/patch_tlevel[1-4]/info_test.csv
or Eliceiri_patches/patch_tlevel[1-4]/info_test.csv
provides the information of an image pair as follow:
Filename: identifier(ID) of the image pair
X1_Ref: x-coordinate of the upper-left corner of reference patch IRef
Y1_Ref: y-coordinate of the upper-left corner of reference patch IRef
X2_Ref: x-coordinate of the lower-left corner of reference patch IRef
Y2_Ref: y-coordinate of the lower-left corner of reference patch IRef
X3_Ref: x-coordinate of the lower-right corner of reference patch IRef
Y3_Ref: y-coordinate of the lower-right corner of reference patch IRef
X4_Ref: x-coordinate of the upper-right corner of reference patch IRef
Y4_Ref: y-coordinate of the upper-right corner of reference patch IRef
X1_Trans: x-coordinate of the upper-left corner of transformed patch IInit
Y1_Trans: y-coordinate of the upper-left corner of transformed patch IInit
X2_Trans: x-coordinate of the lower-left corner of transformed patch IInit
Y2_Trans: y-coordinate of the lower-left corner of transformed patch IInit
X3_Trans: x-coordinate of the lower-right corner of transformed patch IInit
Y3_Trans: y-coordinate of the lower-right corner of transformed patch IInit
X4_Trans: x-coordinate of the upper-right corner of transformed patch IInit
Y4_Trans: y-coordinate of the upper-right corner of transformed patch IInit
Displacement: mean Euclidean distance between reference corner points and transformed corner points
RelativeDisplacement: the ratio of displacement to the width/height of image patch
Tx: randomly generated translation in the x-direction to synthesise the transformed patch IInit
Ty: randomly generated translation in the y-direction to synthesise the transformed patch IInit
AngleDegree: randomly generated rotation in degrees to synthesise the transformed patch IInit
AngleRad: randomly generated rotation in radian to synthesise the transformed patch IInit
Naming convention
zh{ID}_{iRow}_{iCol}_{ReferenceOrTransformed}.png
zh5_03_02_R.png
indicates the Reference patch of the 3rd row and 2nd column cut from the image with ID zh5
.{{cellline}_{treatment}_{fieldofview}_{iFrame}}_{iRow}_{iCol}_{ReferenceOrTransformed}.png
PNT1A_do_1_f15_02_01_T.png
indicates the Transformed patch of the 2nd row and 1st column cut from the image with ID PNT1A_do_1_f15
.{ID}_{ReferenceOrTransformed}.tif
1B_A4_T.tif
indicates the Transformed patch cut from the image with ID 1B_A4
.
This dataset was originally produced by the authors of Is Image-to-Image Translation the Panacea for Multimodal Image Registration? A Comparative Study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents median household incomes for various household sizes in Lower Pottsgrove Township, Pennsylvania, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.
Key observations
https://i.neilsberg.com/ch/lower-pottsgrove-township-pa-median-household-income-by-household-size.jpeg" alt="Lower Pottsgrove Township, Pennsylvania median household income, by household size (in 2022 inflation-adjusted dollars)">
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Household Sizes:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Lower Pottsgrove township median household income. You can refer the same here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reservoir and Lake Surface Area Timeseries (ReaLSAT) dataset provides an unprecedented reconstruction of surface area variations of lakes and reservoirs at a global scale using Earth Observation (EO) data and novel machine learning techniques. The dataset provides monthly scale surface area variations (1984 to 2020) of 681,137 water bodies below 50°N and sizes greater than 0.1 square kilometers.
The dataset contains the following files:
1) ReaLSAT.zip: A shapefile that contains the reference shape of waterbodies in the dataset.
2) monthly_timeseries.zip: contains one CSV file for each water body. The CSV file provides monthly surface area variation values. The CSV files are stored in a subfolder corresponding to each 10 degree by 10 degree cell. For example, monthly_timeseries_60_-50 folders contain CSV files of lakes that lie between 60 E and 70 E longitude, and 50S and 40 S.
3) monthly_shapes_.zip: contains a geotiff for each water body that lie within the 10 degree by 10 degree cell. Please refer to the visualization notebook on how to use these geotiffs.
4) evaluation_data.zip: contains the random subsets of the dataset used for evaluation. The zip file contains a README file that describes the evaluation data.
6) generate_realsat_timeseries.ipynb: a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody.
Please refer to the following papers to learn more about the processing pipeline used to create ReaLSAT dataset:
[1] Khandelwal, Ankush, Anuj Karpatne, Praveen Ravirathinam, Rahul Ghosh, Zhihao Wei, Hilary A. Dugan, Paul C. Hanson, and Vipin Kumar. "ReaLSAT, a global dataset of reservoir and lake surface area variations." Scientific data 9, no. 1 (2022): 1-12.
[2] Khandelwal, Ankush. "ORBIT (Ordering Based Information Transfer): A Physics Guided Machine Learning Framework to Monitor the Dynamics of Water Bodies at a Global Scale." (2019).
Version Updates
Version 2.0:
extends the datasets to 2020.
provides geotiffs instead of shapefiles for individual lakes to reduce dataset size.
provides a notebook to visualize the updated dataset.
Version 1.4: added 1120 large lakes to the dataset and removed partial lakes that overlapped with these large lakes.
Version 1.3: fixed visualization related bug in generate_realsat_timeseries.ipynb
Version 1.2: added a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody in ReaLSAT database.
Since the 1940's, commercial, academic and government hydrologists have used aquifer tests to estimate the hydrogeologic properties of an aquifer near test wells. Results from these tests are recorded in various files, databases, reports, and scientific publications. The Lower Mississippi-Gulf (LMG)-Hydrogeologic Test dataset is an attempt to aggregate these dispersed hydrogeologic test results into a single dataset that is publicly available in a machine-readable format. The hydrogeologic values presented in the Mar2022 version of the LMG-Hydrogeologic Test Dataset were estimated by Douglas Carlson, PhD, with the Louisiana Geological Survey and Associate Professor-Research at Louisiana State University. Hydraulic conductivity estimates were made from specific capacity data using a technique developed by Bradbury and Rothschild (1985). Specific capacity values, from well pumping tests, were obtained from the Louisiana Water Well Registration Database. This Child Item contains the Mar2022 version of the LMG-Hydrogeologic Test dataset with information and results from 7527 aquifer tests. Additionally, this dataset contains 6 attribute tables (.txt files) with additional information for various fields, a zip file containing the geospatial data, a companion attribute table as a .txt file and a readme text file with definitions and descriptions of the attributes and attribute tables. The LMG-Hydrogeologic Aquifer Test dataset - Mar2022 is available in 2 formats: 1) a tab delimited text (.txt) UTF-8 file and 2) an ESRI GIS point shapefile. FIELDS INCLUDED IN THE LMG-HYDROGEOLOGIC TEST DATASET – Mar2022: [a complete list of field names, their definitions and units are listed in the Readme.txt file] Location Data: USGS site identification number, Local identification name, Public Land Survey System Number, Latitude, Longitude, State and County. Well Construction Data: Construction date, well depth, Diameter of well, Diameter of casing, Depth to top of opening (screen) interval, Depth to bottom of opening interval and Length of opening interval. Aquifer Data: Local aquifer name and code, National aquifer name and code, Top of aquifer, Bottom of aquifer, and Thickness of aquifer. Groundwater Test Data: Test date, Yield/discharge, Length of time associated with yield, Static water-level, Production water-level associated with yield, Drawdown associated with yield. Hydrogeologic Data: Specific Capacity, Transmissivity, Horizontal Conductivity, Vertical Conductivity, Permeability and Storage Coefficient. Ancillary Data: Method of Test Analysis and Data Source Reference. DESCRIPTIONS OF ATTACHED FILES: LMG_HydrogeologicTestDataset_Mar2022.txt: is a tab delimited, UTF-8 text file of the LMG-Hydrogeologic Test Dataset Mar2022. Readme.txt: is a text (.txt) file with field names, definitions and units for the LMG-Hydrogeologic Test Dataset Mar2022 and associated attribute tables. AtbtTbl_AqfrCd_Readme.txt: Is an UTF-8 text file containing information from the National Water Information System: Help System web page about USGS groundwater Codes. (accessed December 4, 2019 at https://help.waterdata.usgs.gov/codes-and-parameters) AtbtTbl_FipsGeographyCodes.txt: Is a tab delimited, UTF-8 text file of FIPS (Federal Information Processing Standards) codes, uniquely identifying states, counties and county equivalents in the United States. Note: to reduce the size of this file, City Codes were Removed. (accessed January 8, 2020 at https://www.census.gov/geographies/reference-files/2017/demo/popest/2017-fips.html). AtbtTbl_LocalAqfrCodes.txt: Is a tab delimited, UTF-8 text file of eight-character string identifying an aquifer. Codes are defined by the "Catalog of Aquifer Names and Geologic Unit Codes used by the USGS. (accessed December 4, 2019 at https://help.waterdata.usgs.gov/aqfr_cd) AtbtTbl_NatAqfrCodes.txt: Is a tab delimited, UTF-8 text file of ten-character strings identifying a National aquifer, or principal aquifer of the United States, that are defined as regionally extensive aquifers or aquifer systems that have the potential to be used as a source of potable water. (accessed December 4, 2019 at https://water.usgs.gov/ogw/NatlAqCode-reflist.html) AtbtTbl_TstMthdCodes.txt: Is a tab delimited, UTF-8 text file of codes identifying the test analysis method when reported in the associated reference. AtbtTbl_DataRefNo.txt: Is a tab delimited, UTF-8 text file of references for the source of the associated aquifer test result. CAVEAT: The hydrogeologic test results reported in this dataset have not been through the USGS data review and approval process to receive the Director’s approval. As such, the Mar2022 version of the LMG-Hydrogeologic Test dataset should be considered provisional. Provisional data are released on the condition that neither the USGS nor the United States Government may be held liable for any damages resulting from its use. This dataset was developed as part of the U.S. Geological Survey, Mississippi Alluvial Plain Regional Water-Availability Study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DNS exfiltration dataset was recorded in a realistic network environment. More than 50 million DNS requests were recorded on one of the ISP's DNS servers. The data in the dataset was anonymised by changing all IP addresses using injective mapping. Features in the dataset are split into single request and aggregate features. Single request or DNS label-based features can be calculated for each DNS request independently using only the textual characteristics of the request. On the other hand, aggregate features are calculated using multiple subsequent request from one client to a particular TLD. This reduces the size of the dataset to about 35 million records. The complete list of features with descriptions can be found in dataset_description.txt file. For all of the features which are based on finding English words in the request we used about 60.000 most commom English words. The list of used words can be found in english_words.txt. The main dataset (dataset.csv) contains regular requests and exfiltrations performed using DNSExfiltrator and Iodine tools. Additional dataset (dataset_modified.csv) contains only exfiltrations executed with modified DNSExfiltrator tool. Waiting times between two consecutive requests in this dataset are randomised and the requests also have lower entropy causing the detection to be much harder.
If you use this dataset for your research, please cite: Žiža, K., Tadić, P. & Vuletić, P. DNS exfiltration detection in the presence of adversarial attacks and modified exfiltrator behaviour. Int. J. Inf. Secur. (2023). https://doi.org/10.1007/s10207-023-00723-w
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Our new network dataset is crawled from Douban Movies{https://movie.douban.com}, which is a website providing users comments on movies. Each node in the network represents a movie, and each edge represents that the movies on both ends of it are co-preferenced by audiences, which is provided by Douban. The network contains 31,761 nodes and 179,924 edges. We use the movie profiles to form the attributes of the node. Firstly, we use ``jieba''{https://github.com/fxsjy/jieba}, a widely used Chinese word segmentation tool, to segment movie profiles and filter common stop words and words that appear less than three times in the corpus. Then, we build a TF-IDF vector for each movie using scikit-learn and reduce the dimension to 500 via SVD. We build three downstream tasks for this Douban dataset, including movie genres prediction, rating score level prediction, and popularity level prediction. Genre predicting task is a multi-classification task, we directly use the genres of the movie provided by Douban as the label and each movie has at least one genre. To build the label of the rating score prediction, we rank movies by rating scores and divide them into 10 classes of the same size. Similarly, we rank all the movies according to the number of comments and divide them into three classes of the same size. For each task, we randomly sample 70\% nodes as the train set, 10\% as the validation set, and the rest as the test set.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the dataset presented in the paper The Mountain Habitats Segmentation and Change Detection Dataset accepted for publication in the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Beach, HI, USA, January 6-9, 2015. The full-sized images and masks along with the accompanying files and results can be downloaded here. The size of the dataset is about 2.1 GB.
The dataset is released under the Creative Commons Attribution-Non Commercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/legalcode).
The dataset documentation is hosted on GitHub at the following address: http://github.com/fjean/mhscd-dataset-doc. Direct download links to the latest revision of the documentation are provided below:
PDF format: http://github.com/fjean/mhscd-dataset-doc/raw/master/mhscd-dataset-doc.pdf
Text format: http://github.com/fjean/mhscd-dataset-doc/raw/master/mhscd-dataset-doc.rst
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MIT Environmental Impulse Response Dataset The audio recordings in this dataset are originally created by the Computational Audition Lab at MIT. The source of the data can be found at: https://mcdermottlab.mit.edu/Reverb/IR_Survey.html. The audio files in the dataset have been resampled to a sampling rate of 16 kHz. This resampling was done to reduce the size of the dataset while making it more suitable for various tasks, including data augmentation. The dataset consists of 271 audio files… See the full description on the dataset page: https://huggingface.co/datasets/davidscripka/MIT_environmental_impulse_responses.
A tracer breakthrough curve (BTC) for each sampling station is the ultimate goal of every quantitative hydrologic tracing study, and dataset size can critically affect the BTC. Groundwater-tracing data obtained using in situ automatic sampling or detection devices may result in very high-density data sets. Data-dense tracer BTCs obtained using in situ devices and stored in dataloggers can result in visually cluttered overlapping data points. The relatively large amounts of data detected by high-frequency settings available on in situ devices and stored in dataloggers ensure that important tracer BTC features, such as data peaks, are not missed. Alternatively, such dense datasets can also be difficult to interpret. Even more difficult, is the application of such dense data sets in solute-transport models that may not be able to adequately reproduce tracer BTC shapes due to the overwhelming mass of data. One solution to the difficulties associated with analyzing, interpreting, and modeling dense data sets is the selective removal of blocks of the data from the total dataset. Although it is possible to arrange to skip blocks of tracer BTC data in a periodic sense (data decimation) so as to lessen the size and density of the dataset, skipping or deleting blocks of data also may result in missing the important features that the high-frequency detection setting efforts were intended to detect. Rather than removing, reducing, or reformulating data overlap, signal filtering and smoothing may be utilized but smoothing errors (e.g., averaging errors, outliers, and potential time shifts) need to be considered. Appropriate probability distributions to tracer BTCs may be used to describe typical tracer BTC shapes, which usually include long tails. Recognizing appropriate probability distributions applicable to tracer BTCs can help in understanding some aspects of the tracer migration. This dataset is associated with the following publications: Field, M. Tracer-Test Results for the Central Chemical Superfund Site, Hagerstown, Md. May 2014 -- December 2015. U.S. Environmental Protection Agency, Washington, DC, USA, 2017. Field, M. On Tracer Breakthrough Curve Dataset Size, Shape, and Statistical Distribution. ADVANCES IN WATER RESOURCES. Elsevier Science Ltd, New York, NY, USA, 141: 1-19, (2020).