41 datasets found

f
Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...
acs.figshare.com
xlsx
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00244.s004
Dataset updated
Jun 9, 2023
Dataset provided by
ACS Publications
Authors
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
s
Data from: Data files used to study the distribution of growth in software...
figshare.swinburne.edu.au
pdf
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajesh Vasa (2024). Data files used to study the distribution of growth in software systems [Dataset]. http://doi.org/10.25916/sut.26271970.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25916/sut.26271970.v1
Dataset updated
Jul 22, 2024
Dataset provided by
Swinburne
Authors
Rajesh Vasa
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
The evolution of a software system can be studied in terms of how various properties as reflected by software metrics change over time. Current models of software evolution have allowed for inferences to be drawn about certain attributes of the software system, for instance, regarding the architecture, complexity and its impact on the development effort. However, an inherent limitation of these models is that they do not provide any direct insight into where growth takes place. In particular, we cannot assess the impact of evolution on the underlying distribution of size and complexity among the various classes. Such an analysis is needed in order to answer questions such as 'do developers tend to evenly distribute complexity as systems get bigger?', and 'do large and complex classes get bigger over time?'. These are questions of more than passing interest since by understanding what typical and successful software evolution looks like, we can identify anomalous situations and take action earlier than might otherwise be possible. Information gained from an analysis of the distribution of growth will also show if there are consistent boundaries within which a software design structure exists. The specific research questions that we address in Chapter 5 (Growth Dynamics) of the thesis this data accompanies are: What is the nature of distribution of software size and complexity measures? How does the profile and shape of this distribution change as software systems evolve? Is the rate and nature of change erratic? Do large and complex classes become bigger and more complex as software systems evolve? In our study of metric distributions, we focused on 10 different measures that span a range of size and complexity measures. In order to assess assigned responsibilities we use the two metrics Load Instruction Count and Store Instruction Count. Both metrics provide a measure for the frequency of state changes in data containers within a system. Number of Branches, on the other hand, records all branch instructions and is used to measure the structural complexity at class level. This measure is equivalent to Weighted Method Count (WMC) as proposed by Chidamber and Kemerer (1994) if a weight of 1 is applied for all methods and the complexity measure used is cyclomatic complexity. We use the measures of Fan-Out Count and Type Construction Count to obtain insight into the dynamics of the software systems. The former offers a means to document the degree of delegation, whereas the latter can be used to count the frequency of object instantiations. The remaining metrics provide structural size and complexity measures. In-Degree Count and Out-Degree Count reveal the coupling of classes within a system. These measures are extracted from the type dependency graph that we construct for each analyzed system. The vertices in this graph are classes, whereas the edges are directed links between classes. We associate popularity (i.e., the number of incoming links) with In-Degree Count and usage or delegation (i.e., the number of outgoing links) with Out-Degree Count. Number of Methods, Public Method Count, and Number of Attributes define typical object-oriented size measures and provide insights into the extent of data and functionality encapsulation. The raw metric data (4 .txt files and 1 .log file in a .zip file measuring ~0.5MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).
d
PREDIK Data-Driven: Geospatial Data | USA | Tailor-made datasets: Foot...
datarade.ai
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Predik Data-driven (2024). PREDIK Data-Driven: Geospatial Data | USA | Tailor-made datasets: Foot traffic & Places Data [Dataset]. https://datarade.ai/data-products/predik-data-driven-geospatial-data-usa-tailor-made-datas-predik-data-driven
Explore at:
.json, .csv, .xls, .sqlAvailable download formats
Dataset updated
Sep 30, 2024
Dataset authored and provided by
Predik Data-driven
Area covered
United States
Description
This Location Data & Foot traffic dataset available for all countries include enriched raw mobility data and visitation at POIs to answer questions such as:

-How often do people visit a location? (daily, monthly, absolute, and averages). -What type of places do they visit ? (parks, schools, hospitals, etc) -Which social characteristics do people have in a certain POI? - Breakdown by type: residents, workers, visitors. -What's their mobility like enduring night hours & day hours?
-What's the frequency of the visits partition by day of the week and hour of the day?

Extra insights -Visitors´ relative income Level. -Visitors´ preferences as derived by their visits to shopping, parks, sports facilities, churches, among others.

Overview & Key Concepts Each record corresponds to a ping from a mobile device, at a particular moment in time and at a particular latitude and longitude. We procure this data from reliable technology partners, which obtain it through partnerships with location-aware apps. All the process is compliant with applicable privacy laws.

We clean and process these massive datasets with a number of complex, computer-intensive calculations to make them easier to use in different data science and machine learning applications, especially those related to understanding customer behavior.

Featured attributes of the data Device speed: based on the distance between each observation and the previous one, we estimate the speed at which the device is moving. This is particularly useful to differentiate between vehicles, pedestrians, and stationery observations.

Night base of the device: we calculate the approximated location of where the device spends the night, which is usually their home neighborhood.

Day base of the device: we calculate the most common daylight location during weekdays, which is usually their work location.

Income level: we use the night neighborhood of the device, and intersect it with available socioeconomic data, to infer the device’s income level. Depending on the country, and the availability of good census data, this figure ranges from a relative wealth index to a currency-calculated income.

POI visited: we intersect each observation with a number of POI databases, to estimate check-ins to different locations. POI databases can vary significantly, in scope and depth, between countries.

Category of visited POI: for each observation that can be attributable to a POI, we also include a standardized location category (park, hospital, among others). Coverage: Worldwide.

Delivery schemas We can deliver the data in three different formats:

Full dataset: one record per mobile ping. These datasets are very large, and should only be consumed by experienced teams with large computing budgets.

Visitation stream: one record per attributable visit. This dataset is considerably smaller than the full one but retains most of the more valuable elements in the dataset. This helps understand who visited a specific POI, characterize and understand the consumer's behavior.

Audience profiles: one record per mobile device in a given period of time (usually monthly). All the visitation stream is aggregated by category. This is the most condensed version of the dataset and is very useful to quickly understand the types of consumers in a particular area and to create cohorts of users.
a
Digital Earth Africa's Landsat 8 Annual GeoMAD
deafrica.africageoportal.com
africageoportal.com
+1more
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Africa GeoPortal (2022). Digital Earth Africa's Landsat 8 Annual GeoMAD [Dataset]. https://deafrica.africageoportal.com/datasets/8ab4222b2a584043a765eca942001337
Explore at:
Dataset updated
Jan 27, 2022
Dataset authored and provided by
Africa GeoPortal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
GeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes. This service is ideal for longer-term time series analysis, cloudless imagery and statistical accuracy.

GeoMAD has two main components: Geomedian and Median Absolute Deviations (MADs).

The geomedian component combines measurements collected over the specified timeframe to produce one representative, multispectral measurement for every pixel unit of the African continent. The end result is a comprehensive dataset that can be used to generate true-colour images for visual inspection of anthropogenic or natural landmarks. The full spectral dataset can be used to develop more complex algorithms.

For each pixel, invalid data is discarded, and remaining observations are mathematically summarised using the geomedian statistic. Flyover coverage provided by collecting data over a period of time also helps scope intermittently cloudy areas.

Variations between the geomedian and the individual measurements are captured by the three Median Absolute Deviation (MAD) layers. These are higher-order statistical measurements calculating variation relative to the geomedian. The MAD layers can be used on their own or together with geomedian to gain insights about the land surface and understand change over time.Key PropertiesGeographic Coverage: Continental Africa - approximately 37° North to 35° SouthTemporal Coverage: 2013 – 2020*Spatial Resolution: 30 x 30 meterUpdate frequency: Annual from 2013 - 2020Product Type: Surface Reflectance (SR)Product Level: Analysis Ready (ARD)Number of Bands: 10 BandsParent Dataset: Landsat Collection 2 Level-2 Surface ReflectanceSource Data Coordinate System: WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933)Service Coordinate System: WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933)*Time is enabled on this service using UTC – Coordinated Universal Time. To assure you are seeing the correct year for each annual slice of data, the time zone must be set specifically to UTC in the Map Viewer settings each time this layer is opened in a new map. More information on this setting can be found here: Set the map time zone.ApplicationsGeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes. This service is ideal for:Longer-term time series analysisCloud-free imageryStatistical accuracyAvailable BandsBand IDDescriptionValue rangeData typeNo data valueSR_B2Geomedian SR_B2 (Blue)1 - 10000uint160SR_B3Geomedian SR_B3 (Green)1 - 10000uint160SR_B4Geomedian SR_B4 (Red)1 - 10000uint160SR_B5Geomedian SR_B5 (NIR)1 - 10000uint160SR_B6Geomedian SR_B6 (SWIR 1)1 - 10000uint160SR_B7Geomedian SR_B7 (SWIR 2)1 - 10000uint160SMADSpectral Median Absolute Deviation0 - 1float32NaNEMADEuclidean Median Absolute Deviation0 - 31623float32NaNBCMADBray-Curtis Median Absolute Deviation0 - 1float32NaNCOUNTNumber of clear observations1 - 65535uint160Bands have been subdivided as follows:Geomedian - 6 bands: The geomedian is calculated using the spectral bands of data collected during the specified time period. Surface reflectance values have been scaled between 1 and 10000 to allow for more efficient data storage as unsigned 16-bit integers (uint16). Note parent datasets often contain more bands, some of which are not used in GeoMAD.Median Absolute Deviations (MADs) - 3 bands: Deviations from the geomedian are quantified through median absolute deviation calculations. The GeoMAD service utilises three MADs, each stored in a separate band: Euclidean MAD (EMAD), spectral MAD (SMAD), and Bray-Curtis MAD (BCMAD). Each MAD is calculated using the same ten bands as in the geomedian. SMAD and BCMAD are normalized ratios, therefore they are unitless and their values always fall between 0 and 1. EMAD is a function of surface reflectance but is neither a ratio nor normalized, therefore its valid value range depends on the number of bands used in the geomedian calculation - ten in GeoMAD.Count - 1 band: The number of clear satellite measurements of a pixel for that calendar year. This is around 20 for Landsat 8 annually, but doubles at areas of overlap between scenes. “Count” is not incorporated in either the geomedian or MADs calculations. It is intended for metadata analysis and data validation.ProcessingAll clear observations for the given time period are collated from the parent dataset. Cloudy pixels are identified and excluded. The geomedian and MADs calculations are then performed by the hdstats package. Annual GeoMAD datasets for the period use hdstats version 0.2.Known LimitationsThe Landsat 8 (& 9) GeoMAD has a known issue with data quality over marine regions. The GeoMAD algorithm uses pixel quality information from the input data to identify and mask pixels with poor quality obervations. Landsat 8 & 9 analysis ready satellite images over the ocean often contain negative surface reflectance values, and the GeoMAD masking procedures remove pixels where any negative values occur. Thus, in regions where pixels are persistently negative throughout the year, the GeoMAD product will contain a no-data value. An example of this can be seen in Image 7 below where a shallow marine system contains no-data values in the GeoMAD because the NIR band values in the input data are persistently negative.More details on this dataset can be found here.
Data from: Bushland ET Calculator
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
data.amerigeoss.org
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Bushland ET Calculator [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/bushland-et-calculator-d94e3
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Area covered
Bushland
Description
The Bushland Reference ET calculator was developed at the USDA-ARS Conservation and Production Research Laboratory, Bushland, Texas. Although it was designed and developed for use mainly by producers and crop consultants to manage irrigation scheduling, it can also be used in educational training, research, and other practical application. It uses the ASCE Standardized Reference Evapotranspiration (ET) Equation for calculating grass and alfalfa reference ET at hourly and daily time steps. This program uses the more complex equation for estimating clear-sky solar radiation provided in Appendix D of the ASCE-EWRI ET Manual. Users have the option of using single set or time series weather data to calculate reference ET. Daily reference ET can be calculated either by summing the hourly ET values for a given day or by using averages of the climatic data. Resources in this dataset:Resource Title: Bushland ET Calculator download page. File Name: Web Page, url: https://res1wwwd-o-tarsd-o-tusdad-o-tgov.vcapture.xyz/research/software/download/?softwareid=Bushland+ET+Calculator&modecode=30-90-05-00
Q
Data for: Debating Algorithmic Fairness
data.qdr.syr.edu
Updated Nov 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Hamilton; Melissa Hamilton (2023). Data for: Debating Algorithmic Fairness [Dataset]. http://doi.org/10.5064/F6JOQXNF
Explore at:
pdf(53179), pdf(63339), pdf(285052), pdf(103333), application/x-json-hypothesis(55745), pdf(256399), jpeg(101993), pdf(233414), pdf(536400), pdf(786428), pdf(2243113), pdf(109638), pdf(176988), pdf(59204), pdf(124046), pdf(802960), pdf(82120)Available download formats
Unique identifier
https://doi.org/10.5064/F6JOQXNF
Dataset updated
Nov 13, 2023
Dataset provided by
Qualitative Data Repository
Authors
Melissa Hamilton; Melissa Hamilton
License
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Time period covered
2008 - 2017
Area covered
United States
Description
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method...
E
The Human Know-How Dataset
find.data.gov.scot
dtechtive.com
pdf, zip
Updated Apr 29, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). The Human Know-How Dataset [Dataset]. http://doi.org/10.7488/ds/1394
Explore at:
zip(19.78 MB), zip(0.2837 MB), zip(19.67 MB), zip(69.8 MB), zip(9.433 MB), zip(62.92 MB), zip(20.43 MB), zip(43.28 MB), zip(92.88 MB), zip(13.06 MB), zip(14.86 MB), zip(5.372 MB), zip(0.0298 MB), pdf(0.0582 MB), zip(5.769 MB), zip(90.08 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/1394
Dataset updated
Apr 29, 2016
Description
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
Company Datasets for Business Profiling
datarade.ai
Updated Feb 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset provided by
oxylabs, UAB
Authors
Oxylabs
Area covered
Isle of Man, Moldova (Republic of), Taiwan, British Indian Ocean Territory, Andorra, Northern Mariana Islands, Tunisia, Nepal, Bangladesh, Canada
Description
Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

Company name;

Size;

Founding date;

Location;

Industry;

Revenue;

Employee count;

Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

Fresh and accurate data collected and parsed by our expert web scraping team.

Time and resource savings, allowing you to focus on data analysis and achieving your business goals.

A customized approach tailored to your specific business needs.

Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
InductiveQE Datasets
zenodo.org
zip
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Galkin; Mikhail Galkin (2022). InductiveQE Datasets [Dataset]. http://doi.org/10.5281/zenodo.7306046
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7306046
Dataset updated
Nov 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mikhail Galkin; Mikhail Galkin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
InductiveQE datasets

UPD 2.0: Regenerated datasets free of potential test set leakages

UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs

This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.

Each dataset is a zip archive containing 17 files:

train_graph.txt (pt for wikikg) - original training graph

val_inference.txt (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph

val_predict.txt (pt) - missing edges in the validation inference graph to be predicted.

test_intference.txt (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph

test_predict.txt (pt) - missing edges in the test inference graph to be predicted.

train/valid/test_queries.pkl - queries of the respective split, 14 query types for fb-derived datasets, 9 types for Wikikg (EPFO-only)

*_answers_easy.pkl - easy answers to respective queries that do not require predicting missing links but only edge traversal

*_answers_hard.pkl - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed

train_answers_val.pkl - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

train_answers_test.pkl - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

og_mappings.pkl - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2

stats.txt - a small file with dataset stats

Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.

The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.

Paper pre-print: https://arxiv.org/abs/2210.08008

The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
a
Digital Earth Africa's Sentinel-2 Annual GeoMAD
deafrica.africageoportal.com
uneca.africageoportal.com
+3more
Updated Sep 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Africa GeoPortal (2021). Digital Earth Africa's Sentinel-2 Annual GeoMAD [Dataset]. https://deafrica.africageoportal.com/datasets/a1c5888827b34aaa809427e31bbc2673
Explore at:
Dataset updated
Sep 23, 2021
Dataset authored and provided by
Africa GeoPortal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
GeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes. This service is ideal for longer-term time series analysis, cloudless imagery and statistical accuracy.

GeoMAD has two main components: Geomedian and Median Absolute Deviations (MADs)

The geomedian component combines measurements collected over the specified timeframe to produce one representative, multispectral measurement for every pixel unit of the African continent. The end result is a comprehensive dataset that can be used to generate true-colour images for visual inspection of anthropogenic or natural landmarks. The full spectral dataset can be used to develop more complex algorithms.

For each pixel, invalid data is discarded, and remaining observations are mathematically summarised using the geomedian statistic. Flyover coverage provided by collecting data over a period of time also helps scope intermittently cloudy areas.

Variations between the geomedian and the individual measurements are captured by the three Median Absolute Deviation (MAD) layers. These are higher-order statistical measurements calculating variation relative to the geomedian. The MAD layers can be used on their own or together with geomedian to gain insights about the land surface and understand change over time.Key PropertiesGeographic Coverage: Continental Africa - approximately 37° North to 35° SouthTemporal Coverage: 2017 – 2022*Spatial Resolution: 10 x 10 meterUpdate Frequency: Annual from 2017 - 2022Product Type: Surface Reflectance (SR)Product Level: Analysis Ready (ARD)Number of Bands: 14 BandsParent Dataset: Sentinel-2 Level-2A Surface ReflectanceSource Data Coordinate System: WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933)Service Coordinate System: WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933)*Time is enabled on this service using UTC – Coordinated Universal Time. To assure you are seeing the correct year for each annual slice of data, the time zone must be set specifically to UTC in the Map Viewer settings each time this layer is opened in a new map. More information on this setting can be found here: Set the map time zone.ApplicationsGeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes. This service is ideal for:Longer-term time series analysisCloud-free imageryStatistical accuracyAvailable BandsBand IDDescriptionValue rangeData typeNo data valueB02Geomedian B02 (Blue)1 - 10000uint160B03Geomedian B03 (Green)1 - 10000uint160B04Geomedian B04 (Red)1 - 10000uint160B05Geomedian B05 (Red edge 1)1 - 10000uint160B06Geomedian B06 (Red edge 2)1 - 10000uint160B07Geomedian B07 (Red edge 3)1 - 10000uint160B08Geomedian B08 (Near infrared (NIR) 1)1 - 10000uint160B8AGeomedian B8A (NIR 2)1 - 10000uint160B11Geomedian B11 (Short-wave infrared (SWIR) 1)1 - 10000uint160B12Geomedian B12 (SWIR 2)1 - 10000uint160SMADSpectral Median Absolute Deviation0 - 1float32NaNEMADEuclidean Median Absolute Deviation0 - 31623float32NaNBCMADBray-Curtis Median Absolute Deviation0 - 1float32NaNCOUNTNumber of clear observations1 - 65535uint160Bands can be subdivided as follows:

Geomedian — 10 bands: The geomedian is calculated using the spectral bands of data collected during the specified time period. Surface reflectance values have been scaled between 1 and 10000 to allow for more efficient data storage as unsigned 16-bit integers (uint16). Note parent datasets often contain more bands, some of which are not used in GeoMAD. The geomedian band IDs correspond to bands in the parent Sentinel-2 Level-2A data. For example, the Annual GeoMAD band B02 contains the annual geomedian of the Sentinel-2 B02 band. Median Absolute Deviations (MADs) — 3 bands: Deviations from the geomedian are quantified through median absolute deviation calculations. The GeoMAD service utilises three MADs, each stored in a separate band: Euclidean MAD (EMAD), spectral MAD (SMAD), and Bray-Curtis MAD (BCMAD). Each MAD is calculated using the same ten bands as in the geomedian. SMAD and BCMAD are normalised ratios, therefore they are unitless and their values always fall between 0 and 1. EMAD is a function of surface reflectance but is neither a ratio nor normalised, therefore its valid value range depends on the number of bands used in the geomedian calculation.Count — 1 band: The number of clear satellite measurements of a pixel for that calendar year. This is around 60 annually, but doubles at areas of overlap between scenes. “Count” is not incorporated in either the geomedian or MADs calculations. It is intended for metadata analysis and data validation.ProcessingAll clear observations for the given time period are collated from the parent dataset. Cloudy pixels are identified and excluded. The geomedian and MADs calculations are then performed by the hdstats package. Annual GeoMAD datasets for the period use hdstats version 0.2.More details on this dataset can be found here.
m
USA Mobility & Foot traffic Enriched Data by Predik Data-Driven
app.mobito.io
Updated Feb 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). USA Mobility & Foot traffic Enriched Data by Predik Data-Driven [Dataset]. https://app.mobito.io/data-product/usa-mobility-&-foot-traffic-enriched-data-by-predik-data-driven
Explore at:
Dataset updated
Feb 3, 2023
Area covered
United States
Description
This Mobility & Foot traffic dataset includes enriched mobility data and visitation at POIs to answer questions such as: -How often do people visit a location? (daily, monthly, absolute, and averages). -What type of places do they visit? (parks, schools, hospitals, etc) -Which social characteristics do people have in a certain POI? - Breakdown by type: residents, workers, visitors. -What's their mobility like during night hours & day hours?
-What's the frequency of the visits by day of the week and hour of the day? Extra insights -Visitors´ relative Income Level. -Visitors´ preferences as derived from their visits to shopping, parks, sports facilities, and churches, among others. - Footfall measurement in all types of establishments (shopping malls, stand-alone stores, etc). -Visitors´ preferences as derived from their visits to shopping, parks, sports facilities, and churches, among others. - Origin/Destiny matrix. - Vehicular traffic, measurement of speed, types of vehicles, among other insights. Overview & Key Concepts Each record corresponds to a ping from a mobile device, at a particular moment in time, and at a particular lat and long. We procure this data from reliable technology partners, which obtain it through partnerships with location-aware apps. All the process is compliant with applicable privacy laws. We clean, process and enrich these massive datasets with a number of complex, computer-intensive calculations to make them easier to use in different tailor-made solutions for companies and also data science and machine learning applications, especially those related to understanding customer behavior. Featured attributes of the data Device speed: based on the distance between each observation and the previous one, we estimate the speed at which the device is moving. This is particularly useful to differentiate between vehicles, pedestrians, and stationery observations. Night base of the device: we calculate the approximate location of where the device spends the night, which is usually its home neighborhood. Day base of the device: we calculate the most common daylight location during weekdays, which is usually their work location. Income level: we use the night neighborhood of the device, and intersect it with available socioeconomic data, to infer the device’s income level. Depending on the country, and the availability of good census data, this figure ranges from a relative wealth index to a currency-calculated income. POI visited: we intersect each observation with a number of POI databases, to estimate check-ins to different locations. POI databases can vary significantly, in scope and depth, between countries. Category of visited POI: for each observation that can be attributable to a POI, we also include a standardized location category (park, hospital, among others). Delivery schemas We can deliver the data in three different formats: Full dataset: one record per mobile ping. These datasets are very large, and should only be consumed by experienced teams with large computing budgets. Visitation stream: one record per attributable visit. This dataset is considerably smaller than the full one but retains most of the more valuable elements in the dataset. This helps understand who visited a specific POI, and characterize and understand the consumer's behavior. Audience profiles: one record per mobile device in a given period of time (usually monthly). All the visitation stream is aggregated by category. This is the most condensed version of the dataset and is very useful to quickly understand the types of consumers in a particular area and to create cohorts of users.
Demographic and Health Survey 2012 - Indonesia
microdata.worldbank.org
dev.ihsn.org
+1more
Updated Jun 2, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statistics Indonesia (BPS) (2017). Demographic and Health Survey 2012 - Indonesia [Dataset]. https://microdata.worldbank.org/index.php/catalog/1637
Explore at:
Dataset updated
Jun 2, 2017
Dataset provided by
Statistics Indonesiahttp://www.bps.go.id/
Authors
Statistics Indonesia (BPS)
Time period covered
2012
Area covered
Indonesia
Description
Abstract

The primary objective of the 2012 Indonesia Demographic and Health Survey (IDHS) is to provide policymakers and program managers with national- and provincial-level data on representative samples of all women age 15-49 and currently-married men age 15-54.

The 2012 IDHS was specifically designed to meet the following objectives: • Provide data on fertility, family planning, maternal and child health, adult mortality (including maternal mortality), and awareness of AIDS/STIs to program managers, policymakers, and researchers to help them evaluate and improve existing programs; • Measure trends in fertility and contraceptive prevalence rates, and analyze factors that affect such changes, such as marital status and patterns, residence, education, breastfeeding habits, and knowledge, use, and availability of contraception; • Evaluate the achievement of goals previously set by national health programs, with special focus on maternal and child health; • Assess married men’s knowledge of utilization of health services for their family’s health, as well as participation in the health care of their families; • Participate in creating an international database that allows cross-country comparisons that can be used by the program managers, policymakers, and researchers in the areas of family planning, fertility, and health in general

Geographic coverage

National coverage

Analysis unit

Household

Women age 15-49

Ever married men age 15-54

Never married men age 15-24

Kind of data

Sample survey data [ssd]

Sampling procedure

Indonesia is divided into 33 provinces. Each province is subdivided into districts (regency in areas mostly rural and municipality in urban areas). Districts are subdivided into subdistricts, and each subdistrict is divided into villages. The entire village is classified as urban or rural.

The 2012 IDHS sample is aimed at providing reliable estimates of key characteristics for women age 15-49 and currently-married men age 15-54 in Indonesia as a whole, in urban and rural areas, and in each of the 33 provinces included in the survey. To achieve this objective, a total of 1,840 census blocks (CBs)-874 in urban areas and 966 in rural areas-were selected from the list of CBs in the selected primary sampling units formed during the 2010 population census.

Because the sample was designed to provide reliable indicators for each province, the number of CBs in each province was not allocated in proportion to the population of the province or its urban-rural classification. Therefore, a final weighing adjustment procedure was done to obtain estimates for all domains. A minimum of 43 CBs per province was imposed in the 2012 IDHS design.

Refer to Appendix B in the final report for details of sample design and implementation.

Mode of data collection

Face-to-face [f2f]

Research instrument

The 2012 IDHS used four questionnaires: the Household Questionnaire, the Woman’s Questionnaire, the Currently Married Man’s Questionnaire, and the Never-Married Man’s Questionnaire. Because of the change in survey coverage from ever-married women age 15-49 in the 2007 IDHS to all women age 15-49 in the 2012 IDHS, the Woman’s Questionnaire now has questions for never-married women age 15-24. These questions were part of the 2007 Indonesia Young Adult Reproductive Survey questionnaire.

The Household and Woman’s Questionnaires are largely based on standard DHS phase VI questionnaires (March 2011 version). The model questionnaires were adapted for use in Indonesia. Not all questions in the DHS model were adopted in the IDHS. In addition, the response categories were modified to reflect the local situation.

The Household Questionnaire was used to list all the usual members and visitors who spent the previous night in the selected households. Basic information collected on each person listed includes age, sex, education, marital status, education, and relationship to the head of the household. Information on characteristics of the housing unit, such as the source of drinking water, type of toilet facilities, construction materials used for the floor, roof, and outer walls of the house, and ownership of various durable goods were also recorded in the Household Questionnaire. These items reflect the household’s socioeconomic status and are used to calculate the household wealth index. The main purpose of the Household Questionnaire was to identify women and men who were eligible for an individual interview.

The Woman’s Questionnaire was used to collect information from all women age 15-49. These women were asked questions on the following topics: • Background characteristics (marital status, education, media exposure, etc.) • Reproductive history and fertility preferences • Knowledge and use of family planning methods • Antenatal, delivery, and postnatal care • Breastfeeding and infant and young children feeding practices • Childhood mortality • Vaccinations and childhood illnesses • Marriage and sexual activity • Fertility preferences • Woman’s work and husband’s background characteristics • Awareness and behavior regarding HIV-AIDS and other sexually transmitted infections (STIs) • Sibling mortality, including maternal mortality • Other health issues

Questions asked to never-married women age 15-24 addressed the following: • Additional background characteristics • Knowledge of the human reproduction system • Attitudes toward marriage and children • Role of family, school, the community, and exposure to mass media • Use of tobacco, alcohol, and drugs • Dating and sexual activity

The Man’s Questionnaire was administered to all currently married men age 15-54 living in every third household in the 2012 IDHS sample. This questionnaire includes much of the same information included in the Woman’s Questionnaire, but is shorter because it did not contain questions on reproductive history or maternal and child health. Instead, men were asked about their knowledge of and participation in health-careseeking practices for their children.

The questionnaire for never-married men age 15-24 includes the same questions asked to nevermarried women age 15-24.

Cleaning operations

All completed questionnaires, along with the control forms, were returned to the BPS central office in Jakarta for data processing. The questionnaires were logged and edited, and all open-ended questions were coded. Responses were entered in the computer twice for verification, and they were corrected for computeridentified errors. Data processing activities were carried out by a team of 58 data entry operators, 42 data editors, 14 secondary data editors, and 14 data entry supervisors. A computer package program called Census and Survey Processing System (CSPro), which was specifically designed to process DHS-type survey data, was used in the processing of the 2012 IDHS.

Response rate

The response rates for both the household and individual interviews in the 2012 IDHS are high. A total of 46,024 households were selected in the sample, of which 44,302 were occupied. Of these households, 43,852 were successfully interviewed, yielding a household response rate of 99 percent.

Refer to Table 1.2 in the final report for more detailed summarized results of the of the 2012 IDHS fieldwork for both the household and individual interviews, by urban-rural residence.

Sampling error estimates

The estimates from a sample survey are affected by two types of errors: (1) nonsampling errors, and (2) sampling errors. Nonsampling errors are the results of mistakes made in implementing data collection and data processing, such as failure to locate and interview the correct household, misunderstanding of the questions on the part of either the interviewer or the respondent, and data entry errors. Although numerous efforts were made during the implementation of the 2012 Indonesia Demographic and Health Survey (2012 IDHS) to minimize this type of error, nonsampling errors are impossible to avoid and difficult to evaluate statistically.

Sampling errors, on the other hand, can be evaluated statistically. The sample of respondents selected in the 2012 IDHS is only one of many samples that could have been selected from the same population, using the same design and identical size. Each of these samples would yield results that differ somewhat from the results of the actual sample selected. Sampling error is a measure of the variability between all possible samples. Although the degree of variability is not known exactly, it can be estimated from the survey results.

A sampling error is usually measured in terms of the standard error for a particular statistic (mean, percentage, etc.), which is the square root of the variance. The standard error can be used to calculate confidence intervals within which the true value for the population can reasonably be assumed to fall. For example, for any given statistic calculated from a sample survey, the value of that statistic will fall within a range of plus or minus two times the standard error of that statistic in 95 percent of all possible samples of identical size and design.

If the sample of respondents had been selected as a simple random sample, it would have been possible to use straightforward formulas for calculating sampling errors. However, the 2012 IDHS sample is the result of a multi-stage stratified design, and, consequently, it was necessary to use more complex formulae. The computer software used to calculate sampling errors for the 2012 IDHS is a SAS program. This program used the Taylor linearization method
m
LATAM Mobility & Foot traffic Enriched Data by Predik Data-Driven
app.mobito.io
Updated Feb 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). LATAM Mobility & Foot traffic Enriched Data by Predik Data-Driven [Dataset]. https://app.mobito.io/data-product/latam-mobility-&-foot-traffic-enriched-data-by-predik-data-driven
Explore at:
Dataset updated
Feb 6, 2023
Area covered
Latin America, SOUTH_AMERICA, NORTH_AMERICA
Description
This Mobility & Foot traffic dataset includes enriched mobility data and visitation at POIs to answer questions such as: -How often do people visit a location? (daily, monthly, absolute, and averages). -What type of places do they visit? (parks, schools, hospitals, etc). -Which social characteristics do people have in a certain POI? - Breakdown by type: residents, workers, visitors. -What's their mobility like during night hours & day hours?
-What's the frequency of the visits by day of the week and hour of the day? Extra insights -Visitors´ relative Income Level. - Footfall measurement in all types of establishments (shopping malls, stand-alone stores, etc). -Visitors´ preferences as derived from their visits to shopping, parks, sports facilities, and churches, among others. - Origin/Destiny matrix. - Vehicular traffic, measurement of speed, types of vehicles, among other insights. Overview & Key Concepts Each record corresponds to a ping from a mobile device, at a particular moment in time, and at a particular lat and long. We procure this data from reliable technology partners, which obtain it through partnerships with location-aware apps. All the process is compliant with GDPR and all applicable privacy laws. We clean, process, and enrich these massive datasets with a number of complex, computer-intensive calculations to make them easier to use in different tailor-made solutions for companies and also data science and machine learning applications, especially those related to understanding customer behavior. Featured attributes of the data Device speed: based on the distance between each observation and the previous one, we estimate the speed at which the device is moving. This is particularly useful to differentiate between vehicles, pedestrians, and stationery observations. Night base of the device: we calculate the approximate location of where the device spends the night, which is usually its home neighborhood. Day base of the device: we calculate the most common daylight location during weekdays, which is usually their work location. Income level: we use the night neighborhood of the device, and intersect it with available socioeconomic data, to infer the device’s income level. Depending on the country, and the availability of good census data, this figure ranges from a relative wealth index to a currency-calculated income. POI visited: we intersect each observation with a number of POI databases, to estimate check-ins to different locations. POI databases can vary significantly, in scope and depth, between countries. Category of visited POI: for each observation that can be attributable to a POI, we also include a standardized location category (park, hospital, among others). Delivery schemas We can deliver the data in three different formats: Full dataset: one record per mobile ping. These datasets are very large, and should only be consumed by experienced teams with large computing budgets. Visitation stream: one record per attributable visit. This dataset is considerably smaller than the full one but retains most of the more valuable elements in the dataset. This helps understand who visited a specific POI, and characterize and understand the consumer's behavior. Audience profiles: one record per mobile device in a given period of time (usually monthly). All the visitation stream is aggregated by category. This is the most condensed version of the dataset and is very useful to quickly understand the types of consumers in a particular area and to create cohorts of users.
H
Social Security Administration Public Use Microdata Files (SSAPUMF)
dataverse.harvard.edu
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). Social Security Administration Public Use Microdata Files (SSAPUMF) [Dataset]. http://doi.org/10.7910/DVN/ZDNCVZ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ZDNCVZ
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the social security administration public use microdata files (ssapumf) with r the social security administration (ssa) must be overflowing with quiet heroes, because their public-use microdata files are as inconspicuous as they are thorough. sure, ssa publishes enough great statistical research of their own that outside researchers rarely find ourselves wanting more and fin er data that this agency can provide, but does that stop them from releasing detailed microdata as well? why no. no it does not. if you wake up one morning with a hankerin' to study the person-level lifetime cash-flows of fdr's legacy, roll up your sleeves and start right here. compared to the other data sets on asdfree.com, the social security administr ation public use microdata files (ssapumf) are as straightforward as it gets. you won't find complex sample survey data here, so just review the short-and-to-the-point data descriptions then calculate your statistics the way you would with other non-survey data. each of these files contain either one record per person or one record per person per year, and effortlessly generalize to the entire population of either social security number holders (most of the country) or social security recipients (just beneficiaries). the one-percent samples should be multiplied by 100 to get accurate nationwide count statistics an d the five-percent samples by 20, but ykta (my new urban dictionary entry). this new github repository contains one script: download all microdata.R download each zipped file directly onto your local computer load each file into a data.frame using a mixture of both fancery and schmantzery reproduce the overall count statistics provided in each respective data dictionary save each file as an R data file (.rda) for ultra-fast future use click here to view this lonely script for more detail about the social security administration public use microdata files (ssapumf), visit: < ul> the social security administration home page the social security administration open data initiative the national archives' history of social security notes: i skipped importing these n ew beneficiary data system (nbds) files because i broadly distrust data older than i am and you probably want these easy-to-use, far more current files anyway. confidential to sas, spss, stata, and sudaan users: no doubt they were very impressive when they originally became available. but so was the bone flute. time to transition to r. :D
w
Population and Family Health Survey 2023 - Jordan
microdata.worldbank.org
catalog.ihsn.org
Updated Aug 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Statistics (DoS) (2024). Population and Family Health Survey 2023 - Jordan [Dataset]. https://microdata.worldbank.org/index.php/catalog/6288
Explore at:
Dataset updated
Aug 23, 2024
Dataset authored and provided by
Department of Statistics (DoS)
Time period covered
2023
Area covered
Jordan
Description
Abstract

The 2023 Jordan Population and Family Health Survey (JPFHS) is the eighth Population and Family Health Survey conducted in Jordan, following those conducted in 1990, 1997, 2002, 2007, 2009, 2012, and 2017–18. It was implemented by the Department of Statistics (DoS) at the request of the Ministry of Health (MoH).

The primary objective of the 2023 JPFHS is to provide up-to-date estimates of key demographic and health indicators. Specifically, the 2023 JPFHS: • Collected data at the national level that allowed calculation of key demographic indicators • Explored the direct and indirect factors that determine levels of and trends in fertility and childhood mortality • Measured contraceptive knowledge and practice • Collected data on key aspects of family health, including immunisation coverage among children, prevalence and treatment of diarrhoea and other diseases among children under age 5, and maternity care indicators such as antenatal visits and assistance at delivery • Obtained data on child feeding practices, including breastfeeding, and conducted anthropometric measurements to assess the nutritional status of children under age 5 and women age 15–49 • Conducted haemoglobin testing with eligible children age 6–59 months and women age 15–49 to gather information on the prevalence of anaemia • Collected data on women’s and men’s knowledge and attitudes regarding sexually transmitted infections and HIV/AIDS • Obtained data on women’s experience of emotional, physical, and sexual violence • Gathered data on disability among household members

The information collected through the 2023 JPFHS is intended to assist policymakers and programme managers in evaluating and designing programmes and strategies for improving the health of the country’s population. The survey also provides indicators relevant to the Sustainable Development Goals (SDGs) for Jordan.

Geographic coverage

National coverage

Analysis unit

Household

Individual

Children age 0-5

Woman age 15-49

Man age 15-59

Universe

The survey covered all de jure household members (usual residents), all women aged 15-49, men aged 15-59, and all children aged 0-4 resident in the household.

Kind of data

Sample survey data [ssd]

Sampling procedure

The sampling frame used for the 2023 JPFHS was the 2015 Jordan Population and Housing Census (JPHC) frame. The survey was designed to produce representative results for the country as a whole, for urban and rural areas separately, for each of the country’s 12 governorates, and for four nationality domains: the Jordanian population, the Syrian population living in refugee camps, the Syrian population living outside of camps, and the population of other nationalities. Each of the 12 governorates is subdivided into districts, each district into subdistricts, each subdistrict into localities, and each locality into areas and subareas. In addition to these administrative units, during the 2015 JPHC each subarea was divided into convenient area units called census blocks. An electronic file of a complete list of all of the census blocks is available from DoS. The list contains census information on households, populations, geographical locations, and socioeconomic characteristics of each block. Based on this list, census blocks were regrouped to form a general statistical unit of moderate size, called a cluster, which is widely used in various surveys as the primary sampling unit (PSU). The sample clusters for the 2023 JPFHS were selected from the frame of cluster units provided by the DoS.

The sample for the 2023 JPFHS was a stratified sample selected in two stages from the 2015 census frame. Stratification was achieved by separating each governorate into urban and rural areas. In addition, the Syrian refugee camps in Zarqa and Mafraq each formed a special sampling stratum. In total, 26 sampling strata were constructed. Samples were selected independently in each sampling stratum, through a twostage selection process, according to the sample allocation. Before the sample selection, the sampling frame was sorted by district and subdistrict within each sampling stratum. By using a probability proportional to size selection at the first stage of sampling, an implicit stratification and proportional allocation were achieved at each of the lower administrative levels.

For further details on sample design, see APPENDIX A of the final report.

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Five questionnaires were used for the 2023 JPFHS: (1) the Household Questionnaire, (2) the Woman’s Questionnaire, (3) the Man’s Questionnaire, (4) the Biomarker Questionnaire, and (5) the Fieldworker Questionnaire. The questionnaires, based on The DHS Program’s model questionnaires, were adapted to reflect the population and health issues relevant to Jordan. Input was solicited from various stakeholders representing government ministries and agencies, nongovernmental organisations, and international donors. After all questionnaires were finalised in English, they were translated into Arabic.

Cleaning operations

All electronic data files for the 2023 JPFHS were transferred via SynCloud to the DoS central office in Amman, where they were stored on a password-protected computer. The data processing operation included secondary editing, which required resolution of computer-identified inconsistencies and coding of open-ended questions. Data editing was accomplished using CSPro software. During the duration of fieldwork, tables were generated to check various data quality parameters, and specific feedback was given to the teams to improve performance. Secondary editing and data processing were initiated in July and completed in September 2023.

Response rate

A total of 20,054 households were selected for the sample, of which 19,809 were occupied. Of the occupied households, 19,475 were successfully interviewed, yielding a response rate of 98%.

In the interviewed households, 13,020 eligible women age 15–49 were identified for individual interviews; interviews were completed with 12,595 women, yielding a response rate of 97%. In the subsample of households selected for the male survey, 6,506 men age 15–59 were identified as eligible for individual interviews and 5,873 were successfully interviewed, yielding a response rate of 90%.

Sampling error estimates

The estimates from a sample survey are affected by two types of errors: nonsampling errors and sampling errors. Nonsampling errors are the results of mistakes made in implementing data collection and in data processing, such as failure to locate and interview the correct household, misunderstanding of the questions on the part of either the interviewer or the respondent, and data entry errors. Although numerous efforts were made during the implementation of the 2023 Jordan Population and Family Health Survey (2023 JPFHS) to minimise this type of error, nonsampling errors are impossible to avoid and difficult to evaluate statistically.

Sampling errors, on the other hand, can be evaluated statistically. The sample of respondents selected in the 2023 JPFHS is only one of many samples that could have been selected from the same population, using the same design and sample size. Each of these samples would yield results that differ somewhat from the results of the actual sample selected. Sampling errors are a measure of the variability among all possible samples. Although the degree of variability is not known exactly, it can be estimated from the survey results.

Sampling error is usually measured in terms of the standard error for a particular statistic (mean, percentage, etc.), which is the square root of the variance. The standard error can be used to calculate confidence intervals within which the true value for the population can reasonably be assumed to fall. For example, for any given statistic calculated from a sample survey, the value of that statistic will fall within a range of plus or minus two times the standard error of that statistic in 95% of all possible samples of identical size and design.

If the sample of respondents had been selected by simple random sampling, it would have been possible to use straightforward formulas for calculating sampling errors. However, the 2023 JPFHS sample was the result of a multistage stratified design, and, consequently, it was necessary to use more complex formulas. Sampling errors are computed using SAS programs developed by ICF. These programs use the Taylor linearisation method to estimate variances for survey estimates that are means, proportions, or ratios. The Jackknife repeated replication method is used for variance estimation of more complex statistics such as fertility and mortality rates.

A more detailed description of estimates of sampling errors are presented in APPENDIX B of the survey report.

Data appraisal

Data Quality Tables

Household age distribution

Age distribution of eligible and interviewed women

Age distribution of eligible and interviewed men

Age displacement at age 14/15

Age displacement at age 49/50

Pregnancy outcomes by years preceding the survey

Completeness of reporting

Standardization exercise results from anthropometry training

Height and weight data completeness and quality for children

Height measurements from random subsample of measured children

Interference in height and weight measurements of children

Interference in height and weight measurements of women

Heaping in
Death in the United States
kaggle.com
zip
Updated Aug 3, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Disease Control and Prevention (2017). Death in the United States [Dataset]. https://www.kaggle.com/datasets/cdc/mortality
Explore at:
zip(766333584 bytes)Available download formats
Dataset updated
Aug 3, 2017
Dataset authored and provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Every year the CDC releases the country’s most detailed report on death in the United States under the National Vital Statistics Systems. This mortality dataset is a record of every death in the country for 2005 through 2015, including detailed information about causes of death and the demographic background of the deceased.

It's been said that "statistics are human beings with the tears wiped off." This is especially true with this dataset. Each death record represents somebody's loved one, often connected with a lifetime of memories and sometimes tragically too short.

Putting the sensitive nature of the topic aside, analyzing mortality data is essential to understanding the complex circumstances of death across the country. The US Government uses this data to determine life expectancy and understand how death in the U.S. differs from the rest of the world. Whether you’re looking for macro trends or analyzing unique circumstances, we challenge you to use this dataset to find your own answers to one of life’s great mysteries.

Overview

This dataset is a collection of CSV files each containing one year's worth of data and paired JSON files containing the code mappings, plus an ICD 10 code set. The CSVs were reformatted from their original fixed-width file formats using information extracted from the CDC's PDF manuals using this script. Please note that this process may have introduced errors as the text extracted from the pdf is not a perfect match. If you have any questions or find errors in the preparation process, please leave a note in the forums. We hope to publish additional years of data using this method soon.

A more detailed overview of the data can be found here. You'll find that the fields are consistent within this time window, but some of data codes change every few years. For example, the 113_cause_recode entry 069 only covers ICD codes (I10,I12) in 2005, but by 2015 it covers (I10,I12,I15). When I post data from years prior to 2005, expect some of the fields themselves to change as well.

All data comes from the CDC’s National Vital Statistics Systems, with the exception of the Icd10Code, which are sourced from the World Health Organization.

Project ideas

The CDC's mortality data was the basis of a widely publicized paper, by Anne Case and Nobel prize winner Angus Deaton, arguing that middle-aged whites are dying at elevated rates. One of the criticisms against the paper is that it failed to properly account for the exact ages within the broad bins available through the CDC's WONDER tool. What do these results look like with exact/not-binned age data?

Similarly, how sensitive are the mortality trends being discussed in the news to the choice of bin-widths?

As noted above, the data preparation process could have introduced errors. Can you find any discrepancies compared to the aggregate metrics on WONDER? If so, please let me know in the forums!

WONDER is cited in numerous economics, sociology, and public health research papers. Can you find any papers whose conclusions would be altered if they used the exact data available here rather than binned data from Wonder?

Differences from the first version of the dataset

This version of the dataset was prepared in a completely different many. This has allowed us to provide a much larger volume of data and ensure that codes are available for every field.

We've replaced the batch of sql files with a single JSON per year. Kaggle's platform currently offer's better support for JSON files, and this keeps the number of files manageable.

A tutorial kernel providing a quick introduction to the new format is available here.

Lastly, I apologize if the transition has interrupted anyone's work! If need be, you can still download v1.
🌼 Unveiling the Iris Dataset 🌸
kaggle.com
Updated Jul 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HARISH KUMARdatalab (2023). 🌼 Unveiling the Iris Dataset 🌸 [Dataset]. http://doi.org/10.34740/kaggle/dsv/6209742
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/6209742
Dataset updated
Jul 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
HARISH KUMARdatalab
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context: 🌼 The Iris flower dataset, an iconic multivariate set, was first introduced by the renowned British statistician and biologist, Ronald Fisher in 1936 📝. Commonly known as Anderson's Iris dataset, it was curated by Edgar Anderson to measure the morphologic variation of three Iris species 🌸: Iris Setosa, Iris Virginica, and Iris Versicolor.

📊 The set comprises 100 samples from each species, with four features - sepal length, sepal width, petal length, and petal width, measured in centimetres.

🔬 This dataset has since served as a standard test case for various statistical classification techniques in machine learning, including the widely used support vector machines (SVM).

So, whether you're a newbie dipping your toes into the ML pond or a seasoned data scientist testing out a new classification method, the Iris dataset is a classic starting point! 🎯🚀

Columns:

Sepal Length: The length of the sepal of the iris flower, is measured in centimetres.

Sepal Width: The width of the sepal of the iris flower, measured in centimetres.

Petal Length: The length of the petal of the iris flower, is measured in centimetres.

Petal Width: The width of the petal of the iris flower, measured in centimetres.

Species:The specific species of the iris flower, categorized into Sentosa, Virginica, and Versicolor.

Problem Statement:

1.🎯 Classification Challenge: Can you accurately predict the species of an Iris flower based on the four given measurements: sepal length, sepal width, petal length, and petal width?

2.💡 Feature Importance: Which feature (sepal length, sepal width, petal length, or petal width) is the most significant in distinguishing between the species of Iris flowers?

3.📈 Data Scaling: How does standardization (or normalization) of the features affect the performance of your classification models?

4.🧪 Model Experimentation: Can simpler models such as Logistic Regression perform as well as more complex models like Support Vector Machines or Neural Networks on the Iris dataset? Compare the performance of various models.

5.🤖 AutoML Challenge: Use AutoML tools (like Google's AutoML or H2O's AutoML) to build a classification model. How does its performance compare with your handcrafted models?

Kindly, upvote if you find the dataset interesting
Fermi GBM Burst Catalog - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Fermi GBM Burst Catalog - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/fermi-gbm-burst-catalog
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
When referencing results from this online catalog, please cite von Kienlin, A. et al. 2020, Gruber, D. et al. 2014, von Kienlin, A. et al. 2014, and Bhat, P. et al. 2016. This table lists all of the triggers observed by a subset of the 14 GBM detectors (12 NaI and 2 BGO) which have been classified as gamma-ray bursts (GRBs). Note that there are two Browse catalogs resulting from GBM triggers. All GBM triggers are entered in the Fermi GBM Trigger Catalog, while only those triggers classified as bursts are entered in the Burst Catalog. Thus, a burst will be found in both the Trigger and Burst Catalogs. The Burst Catalog analysis requires human intervention; therefore, GRBs will be entered in the Trigger Catalog before the Burst Catalog. The latency requirements are 1 day for triggers and 3 days for bursts. There are four fewer bursts in the online catalog than in the Gruber et al. 2014 paper. The four missing events (081007224, 091013989, 091022752, and 091208623) have not been classified with certainty as GRBs and are not included in the general GRB catalog. This classification may be revised at a later stage. The GBM consists of an array of 12 sodium iodide (NaI) detectors which cover the lower end of the energy range up to 1 MeV. The GBM triggers off of the rates in the NaI detectors, with some Terrestrial Gamma-ray Flash (TGF)-specific algorithms using the bismuth germanate (BGO) detectors, sensitive to higher energies, up to 40 MeV. The NaI detectors are placed around the Fermi spacecraft with different orientations to provide the required sensitivity and FOV. The cosine-like angular response of the thin NaI detectors is used to localize burst sources by comparing rates from detectors with different viewing angles. The two BGO detectors are placed on opposite sides of the spacecraft so that all sky positions are visible to at least one BGO detector. The signals from all 14 GBM detectors are collected by a central Data Processing Unit (DPU). This unit digitizes and time-tags the detectors' pulse height signals, packages the resulting data into several different types for transmission to the ground (via the Fermi spacecraft), and performs various data processing tasks such as autonomous burst triggering. The GRB science products are transmitted to the FSSC in two types of files. The first file, called the "bcat" file, provides basic burst parameters such as duration, peak flux and fluence, calculated from 8-channel data using a spectral model which has a power-law in energy that falls exponentially above an energy EPeak, known as the Comptonized model. The crude 8-channel binning and the simple spectral model allow data fits in batch mode over numerous time bins in an efficient and robust fashion, including intervals with little or no flux, yielding both values for the burst duration, and deconvolved lightcurves for the detectors included in the fit. The bcat file includes two extensions. The first, containing detailed information about energy channels and detectors used in the calculations, is detector-specific, and includes the time history of the deconvolved flux over the time intervals of the burst. The second shows the evolution of the spectral parameters obtained in a joint fit of the included detectors for the model used, usually the Comptonized model described above. The bcat files and their time-varying quantities contained in these two extensions are available at the HEASARC FTP site. Quantities derived from these batch fits are given in the bcat primary header and presented in the Browse table, as described below. The main purpose of the analysis contained in the bcat file is to produce a measure of the duration of the burst after deconvolving the instrument response. The duration quantities are:
* 't50' - the time taken to accumulate 50% of the burst fluence starting at the 25% fluence level. * 't90' - the time taken to accumulate 90% of the burst fluence starting at the 5% fluence level.
By-products of this analysis include fluxes on various timescales and fluences, both obtained using the simple Comptonized model described above. These quantities are detailed in the Browse table using the following prefixes:
* 'flux' - the peak flux over 3 different timescales obtained in the batch mode fit used to calculate t50/t90. * 'fluence' - the total fluence accumulated in the t50/t90 calculation.
The fluxes and fluences derived from the 8-channel data for these bcat files should be considered less reliable than those in the spectral analysis files described below. Analysis methods used in obtaining these quantities are detailed in the first GBM GRB Catalog (Paciesas et al. 2011). Updates of bcat files will be sent (with new version numbers) as these parameters are refined. This "bcat" file is produced for triggers that are classified as GRBs (with exceptions as described below), and supplements the initial data in the trigger or "tcat" file that is produced for all triggers. The second type of file (the spectrum or "scat" file) provides parameter values and goodness-of-fit measures for different types of spectral fits and models. These fits are performed using 128-channel data, either CSPEC or, for short bursts, TTE data. The type and model are coded into the file name. There are currently two spectrum categories:
* Peak flux ('pflx') - a single spectrum over the time range of the peak flux of the burst * Fluence ('flnc') - a single spectrum over the entire burst duration selected by the duty scientist.
Like the bcat files, the scat files have two extensions. The first extension gives detector-specific information, including photon fluxes and fluences for each detector, which are provided for each energy channel. The second extension provides derived quantities such as flux, fluence and model parameters for the joint fit of all included detectors. The scat files and their energy-resolved quantities contained in these two extensions are available in the Fermi data archive at the HEASARC. Quantities derived from these spectral fits are available in the Browse table, as described below and in Goldstein et al. (2011). The spectra are fit with a number of models, with the signal-to-noise ratio of the spectrum often determining whether a more complex model is statistically favored. The current set is:
* Power law ('plaw'), * Comptonized (exponentially attenuated power law; 'comp') * Band ('band') * Smoothly broken power law ('sbpl')
Warnings The bcat and scat files result from two completely independent analyses, and consequently, it is possible that the same quantities might show differences. Indeed, 1) the fluxes and fluences in the "scat" files should be considered more reliable than those in the "bcat" files, with the official fluxes and fluences being those yielded by the statistically favored model ("Best_Fitting_Model" in the Browse table) and with the full energy resolution of the instrument; 2) in both the bcat and scat analyses, the set of detectors used for the fits ("Scat_Detector_Mask" in the Browse table) may not be the same as the set of detectors that triggered GBM ("Bcat_Detector_Mask" in the Browse table); 3) background definitions are different for the bcat and scat analysis (see References below). Finally, for weak events, it is not always possible to perform duration or spectral analyses, and some bursts occur too close in time to South Atlantic Anomaly entries or exits by Fermi with resultant data truncations that prevent background determinations for the duration analysis. There is not an exact one-to-one correspondence between those events for which the duration analysis fails and those which are too weak to have a useful spectral characterization. This means that in the HEASARC Browse table there are a handful of GRBs which have duration parameters but not spectral fit parameters, and vice versa. In these cases, blank entries in the table indicate missing values where an analysis was not possible. Values of 0.0 for the uncertainties on spectral parameters indicate those parameters have been fixed in the fit from which other parameters or quantities in the table were derived. Missing values for model fit parameters indicate that the fit failed to converge for this model. This is true mostly for the more complicated models (SBPL or BAND) when the fits fail to converge for weaker bursts. Bad spectral fits can often result in unphysical flux and fluence values with undefined errors. We include these bad fits but leave the error fields blank when they contain undefined values. The selection criteria used in the first catalog (Goldstein et al. 2011) for the determination of the best-fit spectral model are different from those in the second catalog (Gruber et al. 2014). The results using the two methods on the sample included in Goldstein et al. (2011) are compared in Gruber et al. (2014). The old catalog files can be retrieved using the HEASARC ftp archive tree, under "previous" directories. The values returned by Browse always come from the "current" directories. The chi-squared statistic was not used in the 2nd catalog, either for parameter optimization or model comparison. The chi-squared values are missing for a few GRBs. This is believed to be because of a known software issue and should not be considered indicative of a bad fit. The variable "scatalog" included in the Browse tables and in the FITS files indicates which catalog a file belongs to, with 2 being the current catalog, and 1 (or absent) the first catalog (preliminary values may appear with value 0). The information in this table is provided by the Fermi
w
Data Analysis and Assessment Center
data.wu.ac.at
Updated Mar 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Laboratory Consortium (2017). Data Analysis and Assessment Center [Dataset]. https://data.wu.ac.at/schema/data_gov/N2Q5ZGUyZjktYTg5MC00NDM4LWFmMWEtOWZkNjUxOGJjYTAx
Explore at:
Dataset updated
Mar 8, 2017
Dataset provided by
Federal Laboratory Consortium
Description
Resources for Advanced Data Analysis and VisualizationResearchers who have access to the latest analysis and visualization tools are able to use large amounts of complex data to find efficiencies in projects, designs, and resources. The Data Analysis and Assessment Center (DAAC) at ERDC's Information Technology Laboratory (ITL) provides visualization and analysis tools and support services to enable the analysis of an ever-increasing volume of data.Simplify Data Analysis and Visualization ResearchThe resources provided by the DAAC enable any user to conduct important data analysis and visualization that provides valuable insight into projects and designs and helps to find ways to save resources. The DAAC provides new tools like ezVIZ, and services such as the DAAC website, a rich resource of news about the DAAC, training materials, a community forum and tutorials on how to use data analysis and other issues.The DAAC can perform collaborative work when users prefer to do the work themselves but need help in choosing which visualization program and/or technique and using the visualization tools. The DAAC also carries out custom projects to produce high-quality animations of data, such as movies, which allow researchers to communicate their results to others.Communicate Research in ContextDAAC provides leading animation and modeling software which allows scientists and researchers may communicate all aspects of their research by setting their results in context through conceptual visualization and data analysis.Success StoriesWave Breaking and Associated Droplet and Bubble FormationWave breaking and associated droplet and bubble formation are among the most challenging problems in the field of free-surface hydrodynamics. The method of computational fluid dynamics (CFD) was used to solve this problem numerically for flow about naval vessels. The researchers wanted to animate the time-varying three-dimensional data sets using isosurfaces, but transferring the data back to the local site was a problem because the data sets were large. The DAAC visualization team solved the problem by using EnSight and ezVIZ to generate the isosurfaces, and photorealistic rendering software to produce the images for the animation.Explosive Structure Interaction Effects in Urban TerrainKnown as the Breaching Project, this research studied the effects of high-explosive (HE) charges on brick or reinforced concrete walls. The results of this research will enable the war fighter to breach a wall to enter a building where enemy forces are conducting operations against U.S. interests. Images produced show computed damaged caused by an HE charge on the outer and inner sides of a reinforced concrete wall. The ability to quickly and meaningfully analyze large simulation data sets helps guide further development of new HE package designs and better ways to deploy the HE packages. A large number of designs can be simulated and analyzed to find the best at breaching the wall. The project saves money in greatly reduced field test costs by testing only the designs which were identified in analysis as the best performers.SpecificationsAmethyst, the seven-node Linux visualization cluster housed at the DAAC, is supported by ParaView, EnSight, and ezViz visualization tools and configured as follows:Six computer nodes, each with the following specifications:CPU: 8 dual-core 2.4 Ghz, 64-bit AMD Opteron Processors (16 effective cores)Memory: 128-G RAMVideo: NVidia Quadro 5500 1-GB memoryNetwork: Infiniband Interconnect between nodes, and Gigabit Ethernet to Defense Research and Engineering Network (DREN)One storage node:Disk Space: 20-TB TerraGrid file system, mounted on all nodes as /viz and /work
ACT HTS - 01 Method of Travel (2022)
researchdata.edu.au
Updated Jun 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ACT Government Open Data (2025). ACT HTS - 01 Method of Travel (2022) [Dataset]. https://researchdata.edu.au/act-hts-01-travel-2022/3734137
Explore at:
Dataset updated
Jun 23, 2025
Dataset provided by
Data.govhttps://data.gov/
Authors
ACT Government Open Data
Description
This spreadsheet replicates selected data tables from the ACT & Queanbeyan Household Travel Survey dashboard. Please refer to the attached spreadsheet on this page. About the Method of Travel theme The theme provides an estimate of the number of activities visited on an average weekday by the residents of ACT and Queanbeyan, and the associated methods of travel used. This theme uses trips to describe main method of travel between activities. If multiple modes are used, the one associated with the longest distance leg is selected as the 'main method'. As an example, typical bus travel to a destination would include at least three individual legs: - the distance required to get to the bus stop (before getting in the bus) - the distance travelled on the bus itself - the distance travelling to the final destination (after getting off the bus). These legs are merged together as a single 'bus trip'. Note that the tables provided represent a small subset of data available. Only the number and proportion of trips are shown, by time period and household region. Use of the dashboard or raw survey datasets allow more complex descriptions of travel to be developed. Source of Data: The data shown is not a Census of travel, but a large survey of several thousand households from across the ACT and Queanbeyan. As with any survey there will be some variability in the accuracy of the results, and how well they reflect the movement of the entire population. For instance, if the survey were to be completed on another day, or with a different subset of households, the results would be slightly different. Interpretations of the data should keep this variability in mind: these are estimates of the broad shape of travel only. Even for the same person, travel behaviour will vary according to many factors: day of week, month of year, season, weather, school holidays, illness, family responsibilities, work from home opportunities, etc. Again, by summarising the travel of many different people, the data provides a view of average weekday patterns. In interpreting the data, it is worth noting the following points: - A zero cell does not necessarily mean the travel is never made, but rather that the survey participants did not make this travel on their particular survey day. - Values are rounded, and may not sum to the totals shown. Trip time periods are assigned using the mid point of travel: - AM peak (8am to 9am), PM peak (5pm to 6pm), Interpeak (9am to 5pm), Off-peak (after 6pm) The survey is described on the Transport Canberra and City Services' website: [Household Travel Survey homepage] Cell annotations and notes Some cells have annotations added to them, as follows: * : Statistically significant difference across survey years (at the 95% confidence level). Confidence intervals indicate where the true measure would typically fall if the survey were repeated multiple times (i.e., 95 times out of 100), recognising that each survey iteration may produce slightly different outcomes. ~ : Unreliable estimate (small sample or wide confidence interval) Additional information Analysis by Sift Research, March 2025. Contact research@sift.group for further information. Enclosed data tables shared under a 'CC BY' Creative Commons licence. This enables users to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. [>More information about CC BY]

Facebook

Twitter

Click to copy link

Link copied

Cite

Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s004

Data from: The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE)

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.1c00244.s004

Dataset updated

Jun 9, 2023

Dataset provided by

ACS Publications

Authors

Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.

Clear search

Close search

Google apps

Main menu

Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

Data from: Data files used to study the distribution of growth in software...

PREDIK Data-Driven: Geospatial Data | USA | Tailor-made datasets: Foot...

Digital Earth Africa's Landsat 8 Annual GeoMAD

Data from: Bushland ET Calculator

Data for: Debating Algorithmic Fairness

The Human Know-How Dataset

Company Datasets for Business Profiling

InductiveQE Datasets

Digital Earth Africa's Sentinel-2 Annual GeoMAD

USA Mobility & Foot traffic Enriched Data by Predik Data-Driven

Demographic and Health Survey 2012 - Indonesia

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

LATAM Mobility & Foot traffic Enriched Data by Predik Data-Driven

Social Security Administration Public Use Microdata Files (SSAPUMF)

Population and Family Health Survey 2023 - Jordan

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Data appraisal

Death in the United States

Overview

Project ideas

Differences from the first version of the dataset

🌼 Unveiling the Iris Dataset 🌸

Fermi GBM Burst Catalog - Dataset - NASA Open Data Portal

Data Analysis and Assessment Center

ACT HTS - 01 Method of Travel (2022)

Data from: The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE)