100+ datasets found

Data from: Classification of Mars Terrain Using Multiple Data Sources
data.nasa.gov
datasets.ai
+3more
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Classification of Mars Terrain Using Multiple Data Sources [Dataset]. https://data.nasa.gov/dataset/classification-of-mars-terrain-using-multiple-data-sources
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Classification of Mars Terrain Using Multiple Data Sources Alan Kraut1, David Wettergreen1 ABSTRACT. Images of Mars are being collected faster than they can be analyzed by planetary scientists. Automatic analysis of images would enable more rapid and more consistent image interpretation and could draft geologic maps where none yet exist. In this work we develop a method for incorporating images from multiple instruments to classify Martian terrain into multiple types. Each image is segmented into contiguous groups of similar pixels, called superpixels, with an associated vector of discriminative features. We have developed and tested several classification algorithms to associate a best class to each superpixel. These classifiers are trained using three different manual classifications with between 2 and 6 classes. Automatic classification accuracies of 50 to 80% are achieved in leave-one-out cross-validation across 20 scenes using a multi-class boosting classifier.
u
ICOADS Input Data Sources
ckanprod.ucar.edu
data.ucar.edu
+2more
binary
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
International Comprehensive Ocean-Atmosphere (ICOADS) Global Partners; Physical Sciences Laboratory, Earth System Research Laboratory, OAR, NOAA, U.S. Department of Commerce (2024). ICOADS Input Data Sources [Dataset]. http://doi.org/10.5065/549Q-3G69
Explore at:
binaryAvailable download formats
Unique identifier
https://doi.org/10.5065/549Q-3G69
Dataset updated
Aug 4, 2024
Dataset provided by
Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory
Authors
International Comprehensive Ocean-Atmosphere (ICOADS) Global Partners; Physical Sciences Laboratory, Earth System Research Laboratory, OAR, NOAA, U.S. Department of Commerce
Time period covered
1784 - 2018
Description
This dataset contains auxiliary, preliminary, and other datasets that are in preparation to be included in a future ICOADS release. Data are provided either in IMMA1 or native (non-IMMA1) format. It also contains datasets in IMMA1 and non-IMMA1 formats that have transitioned into ICOADS. This dataset was created in 2018 in conjunction with the completion of Release 3.0 and efforts going forward - it is not a complete collection of inputs for ICOADS beginning with Release 1. The purpose of this dataset is to provide a common archive point for data exchange with ICOADS researchers and track the provenance as input data sources are added to official releases. These sources are not recommended for general public use. If source data are archived in a different independent RDA dataset, those data are not duplicated in this dataset, but will be referenced with a "Related RDA Dataset" link, e.g. DS285.0 is the World Ocean Database in a non-IMMA1 format provided by NCEI.
f
Data sources.
figshare.com
xls
Updated Aug 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Shahbaz; Jane E. Harding; Barry Milne; Anthony Walters; Martin von Randow; Greg D. Gamble (2024). Data sources. [Dataset]. http://doi.org/10.1371/journal.pone.0308414.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0308414.t001
Dataset updated
Aug 7, 2024
Dataset provided by
PLOS ONE
Authors
Mohammad Shahbaz; Jane E. Harding; Barry Milne; Anthony Walters; Martin von Randow; Greg D. Gamble
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionA combination of self-reported questionnaire and administrative data could potentially enhance ascertainment of outcomes and alleviate the limitations of both in follow up studies. However, it is uncertain how access to only one of these data sources to assess outcomes impact study findings. Therefore, this study aimed to determine whether the study findings would be altered if the outcomes were assessed by different data sources alone or in combination.MethodsAt 50-year follow-up of participants in a randomized trial, we assessed the effect of antenatal betamethasone exposure on the diagnosis of diabetes, pre-diabetes, hyperlipidemia, hypertension, mental health disorders, and asthma using a self-reported questionnaire, administrative data, a combination of both, or any data source, with or without adjudication by an expert panel of five clinicians. Differences between relative risks derived from each data source were calculated using the Bland-Altman approach.ResultsThere were 424 participants (46% of those eligible, aged 49 years, SD 1, 50% male). There were no differences in study outcomes between participants exposed to betamethasone and those exposed to placebo when the outcomes were assessed using different data sources. When compared to the study findings determined using adjudicated outcomes, the mean difference (limits of agreement) in relative risks derived from other data sources were: self-reported questionnaires 0.02 (-0.35 to 0.40), administrative data 0.06 (-0.32 to 0.44), both questionnaire and administrative data 0.01 (-0.41 to 0.43), and any data source, 0.01 (-0.08 to 0.10).ConclusionUtilizing a self-reported questionnaire, administrative data, both questionnaire and administrative data, or any of these sources for assessing study outcomes had no impact on the study findings compared with when study outcomes were assessed using adjudicated outcomes.
w
State of California - Data
data.wu.ac.at
Updated Oct 11, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global (2013). State of California - Data [Dataset]. https://data.wu.ac.at/odso/datahub_io/NDZlMmFjNWEtMGY1ZS00ZWVhLTgzZWEtMmY5ZmFhMGQyMjEx
Explore at:
Dataset updated
Oct 11, 2013
Dataset provided by
Global
Description
About

Data from the State of California. From website:

Access raw State data files, databases, geographic data, and other data sources. Raw State data files can be reused by citizens and organizations for their own web applications and mashups.

Openness

Open. Effectively in the public domain. Terms of use page says:

In general, information presented on this web site, unless otherwise indicated, is considered in the public domain. It may be distributed or copied as permitted by law. However, the State does make use of copyrighted data (e.g., photographs) which may require additional permissions prior to your use. In order to use any information on this web site not owned or created by the State, you must seek permission directly from the owning (or holding) sources. The State shall have the unlimited right to use for any purpose, free of any charge, all information submitted via this site except those submissions made under separate legal contract. The State shall be free to use, for any purpose, any ideas, concepts, or techniques contained in information provided through this site.
Z
Data from: PANACEA dataset - Heterogeneous COVID-19 Claims
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jul 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zubiaga, Arkaitz (2022). PANACEA dataset - Heterogeneous COVID-19 Claims [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6493846
Explore at:
Dataset updated
Jul 15, 2022
Dataset provided by
Kochkina, Elena
Procter, Rob
Zubiaga, Arkaitz
He, Yulan
Liakata, Maria
Arana-Catania, Miguel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset.

This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim.

The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation).

The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019).

The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English.

The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy.

The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels.

The data sources used are:

The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/

CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID

MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID

CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data

TREC Health Misinformation track https://trec-health-misinfo.github.io/

TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html

The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True).

The entries in the dataset contain the following information:

Claim. Text of the claim.

Claim label. The labels are: False, and True.

Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals.

Original information source. Information about which general information source was used to obtain the claim.

Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities.

Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1).

References

Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596

Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109.

Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552.

Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23.

Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885.

Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation.

Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics.

Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
d
Transportation Projects in Your Neighborhood
catalog.data.gov
datasets.ai
+4more
Updated Jul 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of New York (2025). Transportation Projects in Your Neighborhood [Dataset]. https://catalog.data.gov/dataset/transportation-projects-in-your-neighborhood
Explore at:
Dataset updated
Jul 19, 2025
Dataset provided by
State of New York
Description
This data set contains DOT construction project information. The data is refreshed nightly from multiple data sources, therefore the data becomes stale rather quickly.
Data underlying CLICS Version 1.0
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Feb 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johann-Mattis List; Johann-Mattis List (2022). Data underlying CLICS Version 1.0 [Dataset]. http://doi.org/10.5281/zenodo.1194088
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1194088
Dataset updated
Feb 21, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johann-Mattis List; Johann-Mattis List
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the data underlying the original CLICS database. It was assembled from four different sources (see http://clics.lingpy.org for details) and is now deposited here to allow users quick access to the underlying data, since CLICS 1.0 will soon be superseded by a follow-up application based on different data sources.
A
2018 Response to Resistance Subjects Data
data.amerigeoss.org
splitgraph.com
csv, json, rdf, xml
Updated Apr 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States (2022). 2018 Response to Resistance Subjects Data [Dataset]. https://data.amerigeoss.org/dataset/2018-response-to-resistance-subjects-data
Explore at:
rdf, xml, json, csvAvailable download formats
Dataset updated
Apr 13, 2022
Dataset provided by
United States
Description
A dataset of APD Response to Resistance subjects which occurred in 2018. AUSTIN POLICE DEPARTMENT DATA DISCLAIMER 1. The data provided is for informational use only and is not considered official APD crime data as in official Texas DPS or FBI crime reports. 2. APD’s crime database is continuously updated, so reports run at different times may produce different results. Care should be taken when comparing against other reports as different data collection methods and different data sources may have been used. 3. The Austin Police Department does not assume any liability for any decision made or action taken or not taken by the recipient in reliance upon any information or data provided.

The number of use of force subjects in the city of Austin for 2018. This dataset is used to provide additional insight visualizations on use of force subjects in the city of Austin in 2018.

This dataset supports measure(s) S.D.3 of SD23. https://data.austintexas.gov/stories/s/kx2d-jya7

Data Source: Versadex

Calculation: (S.D.3) N/A

Measure Time Period: Annually (Calendar Year)

Automated: no

Date of last description update: 8/10/2020
D
Data Preparation Tools and Software Market Report | Global Forecast From...
dataintelo.com
csv, pdf, pptx
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Data Preparation Tools and Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-data-preparation-tools-and-software-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Sep 12, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Preparation Tools and Software Market Outlook

The global data preparation tools and software market size was valued at USD 3.5 billion in 2023 and is projected to reach USD 11.2 billion by 2032, growing at a compound annual growth rate (CAGR) of 13.6% during the forecast period. This impressive growth can be attributed to the increasing need for data-driven decision-making, the rising adoption of big data analytics, and the growing importance of business intelligence across various industries.

One of the key growth factors driving the data preparation tools and software market is the exponential increase in data volume generated by both enterprises and consumers. With the proliferation of IoT devices, social media, and digital transactions, organizations are inundated with vast amounts of data that need to be processed and analyzed efficiently. Data preparation tools help in cleaning, transforming, and structuring this raw data, making it usable for analytics and business intelligence, thereby enabling companies to derive actionable insights and maintain a competitive edge.

Another significant driver for the market is the rising complexity of data sources and types. Organizations today deal with diverse datasets coming from various sources such as relational databases, cloud storage, APIs, and even machine-generated data. Data preparation tools and software provide automated and scalable solutions to handle these complex datasets, ensuring data consistency and accuracy. The tools also facilitate seamless integration with various data sources, enabling organizations to create a unified view of their data landscape, which is crucial for effective decision-making.

The growing adoption of advanced technologies such as AI and machine learning is also boosting the demand for data preparation tools and software. These technologies require high-quality, well-prepared data to function efficiently and generate reliable outcomes. Data preparation tools that incorporate AI capabilities can automate many of the repetitive and time-consuming tasks involved in data cleaning and transformation, thereby improving productivity and reducing human error. This, in turn, accelerates the implementation of AI-driven solutions across different sectors, further propelling market growth.

Regionally, North America currently holds the largest share of the data preparation tools and software market, driven by the presence of leading technology companies and a robust infrastructure for data analytics and business intelligence. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, fueled by rapid digitization, increasing adoption of cloud-based solutions, and significant investments in big data and AI technologies. Europe is also a key market, with growing awareness about data governance and privacy regulations driving the adoption of data preparation tools.

Component Analysis

When analyzing the data preparation tools and software market by component, it is broadly categorized into software and services. The software segment is further divided into standalone data preparation tools and integrated solutions that come as part of larger analytics or business intelligence platforms. Standalone data preparation tools offer specialized functionalities such as data cleaning, transformation, and enrichment, catering to specific data preparation needs. These tools are particularly popular among organizations that require high levels of customization and flexibility in their data preparation processes.

On the other hand, integrated solutions are gaining traction due to their ability to provide end-to-end capabilities, from data preparation to visualization and analytics, all within a single platform. These solutions typically offer seamless integration with other business intelligence tools, enabling users to move from data preparation to analysis without switching between different software. This integrated approach is particularly beneficial for enterprises looking to streamline their data workflows and improve operational efficiency.

The services segment includes professional services such as consulting, implementation, and training, as well as managed services. Professional services are crucial for organizations that lack in-house expertise in data preparation and need external assistance to set up and optimize their data preparation processes. These services help organizations effectively leverage data preparation tools, ensuring that they achieve maximum ROI. Managed services, on the other hand, are
BSVerticalOzone database
zenodo.org
data.niaid.nih.gov
+1more
nc
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Birgit Hassler; Birgit Hassler; Stefanie Kremser; Stefanie Kremser; Greg Bodeker; Greg Bodeker; Jared Lewis; Jared Lewis; Kage Nesbit; Sean Davis; Sandip Dhomse; Sandip Dhomse; Martin Dameris; Martin Dameris; Kage Nesbit; Sean Davis (2020). BSVerticalOzone database [Dataset]. http://doi.org/10.5281/zenodo.1217184
Explore at:
ncAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1217184
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Birgit Hassler; Birgit Hassler; Stefanie Kremser; Stefanie Kremser; Greg Bodeker; Greg Bodeker; Jared Lewis; Jared Lewis; Kage Nesbit; Sean Davis; Sandip Dhomse; Sandip Dhomse; Martin Dameris; Martin Dameris; Kage Nesbit; Sean Davis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An updated and improved version of a global, vertically resolved, monthly mean zonal mean ozone database has been calculated – hereafter referred to as the BSVertOzone database, the BSVertOzone database. Like its predecessor, it combines measurements from several satellite-based instruments and ozone profile measurements from the global ozonesonde network. Monthly mean zonal mean ozone concentrations in mixing ratio and number density are provided in 5 latitude zones, spanning 70 altitude levels (1 to 70km), or 70 pressure 5 levels that are approximately 1km apart (878.4hPa to 0.046hPa). Different data sets or "Tiers" are provided: "Tier 0" is based only on the available measurements and therefore does not completely cover the whole globe or the full vertical range uniformly; the "Tier 0.5" monthly mean zonal means are calculated from a filled version of the Tier 0 database where missing monthly mean zonal mean values are estimated from correlations at level 20 against a total column ozone database and then at levels above and below on correlations with lower and upper levels respectively. The Tier 10 0.5 database includes the full range of measurement variability and is created as an intermediate step for the calculation of the "Tier 1" data where a least squares regression model is used to attribute variability to various known forcing factors for ozone. Regression model fit coefficients are expanded in Fourier series and Legendre polynomials (to account for seasonality and latitudinal structure, respectively). Four different combinations of contributions from selected regression model basis functions result in four different "Tier 1" data set that can be used for comparisons with chemistry-climate model simulations that do not 15 exhibit the same unforced variability as reality (unless they are nudged towards reanalyses). Compared to previous versions of the database, this update includes additional satellite data sources and ozonesonde measurements to extend the database period to 2016. Additional improvements over the previous version of the database include: (i) Adjustments of measurements to account for biases and drifts between different data sources (using a chemistry-transport model simulation as a transfer standard), (ii) a more objective way to determine the optimum number of Fourier and Legendre expansions for the basis 20 function fit coefficients, and (iii) the derivation of methodological and measurement uncertainties on each database value are traced through all data modification steps. Comparisons with the ozone database from SWOOSH (Stratospheric Water and OzOne Satellite Homogenized data set) show excellent agreements in many regions of the globe, and minor differences caused by different bias adjustment procedures for the two databases. However, compared to SWOOSH, BSVertOzone additionally covers the troposphere.
e
Synthetic Administrative Data: Census 1991, 2023 - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Synthetic Administrative Data: Census 1991, 2023 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/6f71c471-1b89-5932-b354-700afb58cb5c
Explore at:
Dataset updated
Oct 11, 2024
Description
We create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.National Statistical Institutes (NSIs) are directing resources into advancing the use of administrative data in official statistics systems. This is a top priority for the UK Office for National Statistics (ONS) as they are undergoing transformations in their statistical systems to make more use of administrative data for future censuses and population statistics. Administrative data are defined as secondary data sources since they are produced by other agencies as a result of an event or a transaction relating to administrative procedures of organisations, public administrations and government agencies. Nevertheless, they have the potential to become important data sources for the production of official statistics by significantly reducing the cost and burden of response and improving the efficiency of such systems. Embedding administrative data in statistical systems is not without costs and it is vital to understand where potential errors may arise. The Total Administrative Data Error Framework sets out all possible sources of error when using administrative data as statistical data, depending on whether it is a single data source or integrated with other data sources such as survey data. For a single administrative data, one of the main sources of error is coverage and representation to the target population of interest. This is particularly relevant when administrative data is delivered over time, such as tax data for maintaining the Business Register. For sub-project 1 of this research project, we develop quality indicators that allow the statistical agency to assess if the administrative data is representative to the target population and which sub-groups may be missing or over-covered. This is essential for producing unbiased estimates from administrative data. Another priority at statistical agencies is to produce a statistical register for population characteristic estimates, such as employment statistics, from multiple sources of administrative and survey data. Using administrative data to build a spine, survey data can be integrated using record linkage and statistical matching approaches on a set of common matching variables. This will be the topic for sub-project 2, which will be split into several topics of research. The first topic is whether adding statistical predictions and correlation structures improves the linkage and data integration. The second topic is to research a mass imputation framework for imputing missing target variables in the statistical register where the missing data may be due to multiple underlying mechanisms. Therefore, the third topic will aim to improve the mass imputation framework to mitigate against possible measurement errors, for example by adding benchmarks and other constraints into the approaches. On completion of a statistical register, estimates for key target variables at local areas can easily be aggregated. However, it is essential to also measure the precision of these estimates through mean square errors and this will be the fourth topic of the sub-project. Finally, this new way of producing official statistics is compared to the more common method of incorporating administrative data through survey weights and model-based estimation approaches. In other words, we evaluate whether it is better 'to weight' or 'to impute' for population characteristic estimates - a key question under investigation by survey statisticians in the last decade. This is a synthetic administrative dataset with only 6 variables to enable the calculation of quality indicators in the R package: https://github.com/sook-tusk/qualadmin See also the user manual. The dataset was created from a 1991 synthetic UK census dataset containing over 1 million records by deleting, moving and duplicating records across geographies according to pre-specified proportions within broad ethnic group and gender. The geography variable includes 6 local authorities but they are completely anonymized and labelled 1,2..6. Other variables are (number of categories in parentheses): sex (2), age groups (14), ethnic groups (5) and employment (3). The final size of the synthetic administrative data is 1033664 individuals. The description of the variables are in the data dictionary that is uploaded with the data.
u
Data from: GALLO: An R package for Genomic Annotation and integration of...
portalcientifico.unileon.es
Updated 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fonseca, Pablo, A.S.; Suárez-Vega, Aroa; Marras, Gabriele; Cánovas, Ángela; Fonseca, Pablo, A.S.; Suárez-Vega, Aroa; Marras, Gabriele; Cánovas, Ángela (2020). GALLO: An R package for Genomic Annotation and integration of multiple data source in livestock for positional candidate LOci [Dataset]. https://portalcientifico.unileon.es/documentos/668fc461b9e7c03b01bdb93f
Explore at:
Dataset updated
2020
Authors
Fonseca, Pablo, A.S.; Suárez-Vega, Aroa; Marras, Gabriele; Cánovas, Ángela; Fonseca, Pablo, A.S.; Suárez-Vega, Aroa; Marras, Gabriele; Cánovas, Ángela
Description
The development of high-throughput sequencing and genotyping methodologies allowed the identification of thousands of genomic regions associated with several complex traits. The integration of multiple sources of biological information is a crucial step required to better understand patterns regulating the development of these traits. Genomic Annotation in Livestock for positional candidate LOci (GALLO) is an R package developed for the accurate annotation of genes and quantitative trait loci (QTLs) located in regions identified in common genomic analyses performed in livestock, such as Genome-Wide Association Studies and transcriptomics using RNA-Sequencing. Moreover, GALLO allows the graphical visualization of gene and QTL annotation results, data comparison among different grouping factors (e.g., methods, breeds, tissues, statistical models, studies, etc.), and QTL enrichment in different livestock species including cattle, pigs, sheep, and chickens, etc. Consequently, GALLO is a useful package for the annotation, identification of hidden patterns across datasets, datamining previously reported associations, as well as the efficient scrutinization of the genetic architecture of complex traits in livestock.
A
‘2018 RP Citations’ analyzed by Analyst-2
analyst-2.ai
Updated Dec 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2018). ‘2018 RP Citations’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-2018-rp-citations-ef7e/04ac51ff/?iid=001-998&v=presentation
Explore at:
Dataset updated
Dec 15, 2018
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘2018 RP Citations’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/65527fab-0701-44f9-89a0-7fe9d7161472 on 12 February 2022.

--- Dataset description provided by original source is as follows ---

AUSTIN POLICE DEPARTMENT DATA DISCLAIMER 1. The data provided are for informational use only and may differ from official APD crime data. 2. APD’s crime database is continuously updated, so reports run at different times may produce different results. Care should be taken when comparing against other reports as different data collection methods and different data sources may have been used. 3. The Austin Police Department does not assume any liability for any decision made or action taken or not taken by the recipient in reliance upon any information or data provided.

--- Original source retains full ownership of the source dataset ---
H
Health Data Interactive
dataverse.harvard.edu
Updated Feb 10, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2011). Health Data Interactive [Dataset]. http://doi.org/10.7910/DVN/BHUMXV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/BHUMXV
Dataset updated
Feb 10, 2011
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Health Data Interactive produces a series of tables on national health statistics. Tables are customizable, and users can download tables, charts and reports. Health topics include, and are not limited to: hospital discharges, mental health, disabilities, diabetes and childbirth. Background Health Data Interactive is a part of the National Center for Health Statistics at the Centers for Disease Control and Prevention. From this website, users can get information on a variety of health topics and trends in the United States. Topics are organized under the following categories: health and functional status; health care use and expenditures; health conditions; health insurance and access; mortality and life expectancy; pregnancy and birth; risk factors and disease prevention. User Functionality Users can choose to download the data files or view the results in chart, table or graph form. Since users can control the which variables are included and how they are presented, they have a wide variety of customization options; directions for how to work with the tables are provided. Users can view data by age, race/ethnicity, gender and/ or geographic region. Data Notes Fourteen different data sources are used and are clearly identified for each table. There are links to each source from the website. The tables are updated frequently, but the site does not specify when. The most recent data is from 2009.
H
Scenario Planning - Raw Source Data
dataverse.harvard.edu
datasetcatalog.nlm.nih.gov
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomar Anez; Dimar Anez (2025). Scenario Planning - Raw Source Data [Dataset]. http://doi.org/10.7910/DVN/PXRVDS
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/PXRVDS
Dataset updated
May 6, 2025
Dataset provided by
Harvard Dataverse
Authors
Diomar Anez; Dimar Anez
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains raw, unprocessed data files pertaining to the management tool group 'Scenario Planning', including related concepts like Scenario Analysis and Contingency Planning. The data originates from five distinct sources, each reflecting different facets of the tool's prominence and usage over time. Files preserve the original metrics and temporal granularity before any comparative normalization or harmonization. Data Sources & File Details: Google Trends File (Prefix: GT_): Metric: Relative Search Interest (RSI) Index (0-100 scale). Keywords Used: "scenario planning" + "scenario analysis" + "contingency planning" + "scenario planning business" Time Period: January 2004 - January 2025 (Native Monthly Resolution). Scope: Global Web Search, broad categorization. Extraction Date: Data extracted January 2025. Notes: Index relative to peak interest within the period for these terms. Reflects public/professional search interest trends. Based on probabilistic sampling. Source URL: Google Trends Query Google Books Ngram Viewer File (Prefix: GB_): Metric: Annual Relative Frequency (% of total n-grams in the corpus). Keywords Used: Scenario Planning + Scenario Analysis + Contingency Planning + Scenario and Contingency Planning Time Period: 1950 - 2022 (Annual Resolution). Corpus: English. Parameters: Case Insensitive OFF, Smoothing 0. Extraction Date: Data extracted January 2025. Notes: Reflects term usage frequency in Google's digitized book corpus. Subject to corpus limitations (English bias, coverage). Source URL: Ngram Viewer Query Crossref.org File (Prefix: CR_): Metric: Absolute count of publications per month matching keywords. Keywords Used: ("scenario planning" OR "scenario analysis" OR "contingency planning" OR "scenario and contingency planning") AND ("management" OR "strategic" OR "business" OR "planning" OR "implementation" OR "approach" OR "framework") Time Period: 1950 - 2025 (Queried for monthly counts based on publication date metadata). Search Fields: Title, Abstract. Extraction Date: Data extracted January 2025. Notes: Reflects volume of relevant academic publications indexed by Crossref. Deduplicated using DOIs; records without DOIs omitted. Source URL: Crossref Search Query Bain & Co. Survey - Usability File (Prefix: BU_): Metric: Original Percentage (%) of executives reporting tool usage. Tool Names/Years Included: Scenario Planning (1993, 1999, 2000); Scenario and Contingency Planning (2004, 2006, 2008, 2010, 2012, 2014, 2017); Scenario Analysis and Contingency Planning (2022). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., Ronan C. et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2023). Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1999/475; 2000/214; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268; 2022/1068. Bain & Co. Survey - Satisfaction File (Prefix: BS_): Metric: Original Average Satisfaction Score (Scale 0-5). Tool Names/Years Included: Scenario Planning (1993, 1999, 2000); Scenario and Contingency Planning (2004, 2006, 2008, 2010, 2012, 2014, 2017); Scenario Analysis and Contingency Planning (2022). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., Ronan C. et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2023). Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1999/475; 2000/214; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268; 2022/1068. Reflects subjective executive perception of utility. File Naming Convention: Files generally follow the pattern: PREFIX_Tool.csv, where the PREFIX indicates the data source: GT_: Google Trends GB_: Google Books Ngram CR_: Crossref.org (Count Data for this Raw Dataset) BU_: Bain & Company Survey (Usability) BS_: Bain & Company Survey (Satisfaction) The essential identification comes from the PREFIX and the Tool Name segment. This dataset resides within the 'Management Tool Source Data (Raw Extracts)' Dataverse.
A
‘R2R 2010’ analyzed by Analyst-2
analyst-2.ai
Updated Dec 19, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2010). ‘R2R 2010’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-r2r-2010-332d/bd16e1a1/?iid=024-969&v=presentation
Explore at:
Dataset updated
Dec 19, 2010
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘R2R 2010’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/75c7ed76-c7d7-451c-8ed6-36d485c68c68 on 26 January 2022.

--- Dataset description provided by original source is as follows ---

ResponsAUSTIN POLICE DEPARTMENT DATA DISCLAIMER 1. The data provided are for informational use only and may differ from official APD crime data. 2. APD’s crime database is continuously updated, so reports run at different times may produce different results. Care should be taken when comparing against other reports as different data collection methods and different data sources may have been used. 3. The Austin Police Department does not assume any liability for any decision made or action taken or not taken by the recipient in reliance upon any information or data provided. e to Resistance dataset for 2010

--- Original source retains full ownership of the source dataset ---
s
Geonames - All Cities with a population > 1000
data.smartidf.services
public.opendatasoft.com
+1more
csv, excel, geojson +1
Updated Mar 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Geonames - All Cities with a population > 1000 [Dataset]. https://data.smartidf.services/explore/dataset/geonames-all-cities-with-a-population-1000/
Explore at:
csv, geojson, json, excelAvailable download formats
Dataset updated
Mar 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
O
Site Plan Cases
data.austintexas.gov
datahub.austintexas.gov
+2more
Updated Aug 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Austin, Texas - data.austintexas.gov (2025). Site Plan Cases [Dataset]. https://data.austintexas.gov/w/mavg-96ck/7r79-5ncn?cur=M_teYOSBDn8
Explore at:
tsv, csv, application/rdfxml, application/geo+json, kmz, application/rssxml, xml, kmlAvailable download formats
Dataset updated
Aug 4, 2025
Dataset authored and provided by
City of Austin, Texas - data.austintexas.gov
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
City of Austin Open Data Terms of Use https://data.austintexas.gov/stories/s/ranj-cccq

This data set contains information about the site plan case applications submitted for review to the City of Austin. The data set includes information about case status in the permit review system, case number, proposed use, applicant, owner, and location.

Development Services DEPARTMENT DATA DISCLAIMER:

The data provided are for informational use only and may differ from official DSD data.

DSD’s database is continuously updated, so reports run at different times may produce different results. Care should be taken when comparing against other reports as different data collection methods and different data sources may have been used.

The Development Services Department does not assume any liability for any decision made or action taken or not taken by the recipient in reliance upon any information or data provided.
f
DataSheet1_Ensemble Models of For-Hire Vehicle Trips.zip
figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Wu; David Levinson (2023). DataSheet1_Ensemble Models of For-Hire Vehicle Trips.zip [Dataset]. http://doi.org/10.3389/ffutr.2022.876880.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/ffutr.2022.876880.s001
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Hao Wu; David Levinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ensemble forecasting is class of modeling approaches that combines different data sources, models of different types, with different assumptions, and/or pattern recognition methods. By comprehensively pooling information from multiple sources, analyzed with different techniques, ensemble models can be more accurate, and can better account for different sources of real-world uncertainties. The share of for-hire vehicle (FHV) trips increased rapidly in recent years. This paper applies ensemble models to predicting for-hire vehicle (FHV) trips in Chicago and New York City, showing that properly applied ensemble models can improve forecast accuracy beyond the best single model.
g
Development Economics Data Group - Regulatory Quality: Number of Sources |...
gimi9.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Economics Data Group - Regulatory Quality: Number of Sources | gimi9.com [Dataset]. https://gimi9.com/dataset/worldbank_wb_wdi_rq_no_src/
Explore at:
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Worldwide Governance Indicators (WGI) are a research dataset summarizing the views on the quality of governance provided by a large number of enterprise, citizen and expert survey respondents in industrial and developing countries. Governance consists of the traditions and institutions by which authority in a country is exercised. This includes the process by which governments are selected, monitored and replaced; the capacity of the government to effectively formulate and implement sound policies; and the respect of citizens and the state for the institutions that govern economic and social interactions among them. Number of sources indicates the number of underlying data sources on which the aggregate estimate is based. The WGI are based on a large number of different data sources, capturing the views and experiences of survey respondents and experts in the public and private sectors, as well as various NGOs. These data sources include: (a) surveys of households and firms (e.g. Afrobarometer surveys, Gallup World Poll, and Global Competitiveness Report survey), (b) NGOs (e.g. Global Integrity, Freedom House, Reporters Without Borders), (c) commercial business information providers (e.g. Economist Intelligence Unit, S&P Global, Political Risk Services), and (d) public sector organizations (e.g. CPIA assessments of World Bank and regional development banks). Regulatory Quality captures perceptions of the ability of the government to formulate and implement sound policies and regulations that permit and promote private sector development.

Facebook

Twitter

Click to copy link

Link copied

Cite

nasa.gov (2025). Classification of Mars Terrain Using Multiple Data Sources [Dataset]. https://data.nasa.gov/dataset/classification-of-mars-terrain-using-multiple-data-sources

Data from: Classification of Mars Terrain Using Multiple Data Sources

Explore at:

Dataset updated

Mar 31, 2025

Dataset provided by

NASAhttp://nasa.gov/

Description

Classification of Mars Terrain Using Multiple Data Sources Alan Kraut1, David Wettergreen1 ABSTRACT. Images of Mars are being collected faster than they can be analyzed by planetary scientists. Automatic analysis of images would enable more rapid and more consistent image interpretation and could draft geologic maps where none yet exist. In this work we develop a method for incorporating images from multiple instruments to classify Martian terrain into multiple types. Each image is segmented into contiguous groups of similar pixels, called superpixels, with an associated vector of discriminative features. We have developed and tested several classification algorithms to associate a best class to each superpixel. These classifiers are trained using three different manual classifications with between 2 and 6 classes. Automatic classification accuracies of 50 to 80% are achieved in leave-one-out cross-validation across 20 scenes using a multi-class boosting classifier.

Clear search

Close search

Google apps

Main menu

Data from: Classification of Mars Terrain Using Multiple Data Sources

ICOADS Input Data Sources

Data sources.

State of California - Data

About

Openness

Data from: PANACEA dataset - Heterogeneous COVID-19 Claims

Transportation Projects in Your Neighborhood

Data underlying CLICS Version 1.0

2018 Response to Resistance Subjects Data

Data Preparation Tools and Software Market Report | Global Forecast From...

Data Preparation Tools and Software Market Outlook

Component Analysis

BSVerticalOzone database

Synthetic Administrative Data: Census 1991, 2023 - Dataset - B2FIND

Data from: GALLO: An R package for Genomic Annotation and integration of...

‘2018 RP Citations’ analyzed by Analyst-2

Health Data Interactive

Scenario Planning - Raw Source Data

‘R2R 2010’ analyzed by Analyst-2

Geonames - All Cities with a population > 1000

Site Plan Cases

Development Services DEPARTMENT DATA DISCLAIMER:

DataSheet1_Ensemble Models of For-Hire Vehicle Trips.zip

Development Economics Data Group - Regulatory Quality: Number of Sources |...

Data from: Classification of Mars Terrain Using Multiple Data SourcesSee More Versions

Data from: Classification of Mars Terrain Using Multiple Data Sources