68 datasets found
  1. Weather and Housing in North America

    • kaggle.com
    zip
    Updated Feb 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Weather and Housing in North America [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-and-housing-in-north-america
    Explore at:
    zip(512280 bytes)Available download formats
    Dataset updated
    Feb 13, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    North America
    Description

    Weather and Housing in North America

    Exploring the Relationship between Weather and Housing Conditions in 2012

    By [source]

    About this dataset

    This comprehensive dataset explores the relationship between housing and weather conditions across North America in 2012. Through a range of climate variables such as temperature, wind speed, humidity, pressure and visibility it provides unique insights into the weather-influenced environment of numerous regions. The interrelated nature of housing parameters such as longitude, latitude, median income, median house value and ocean proximity further enhances our understanding of how distinct climates play an integral part in area real estate valuations. Analyzing these two data sets offers a wealth of knowledge when it comes to understanding what factors can dictate the value and comfort level offered by residential areas throughout North America

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset offers plenty of insights into the effects of weather and housing on North American regions. To explore these relationships, you can perform data analysis on the variables provided.

    First, start by examining descriptive statistics (i.e., mean, median, mode). This can help show you the general trend and distribution of each variable in this dataset. For example, what is the most common temperature in a given region? What is the average wind speed? How does this vary across different regions? By looking at descriptive statistics, you can get an initial idea of how various weather conditions and housing attributes interact with one another.

    Next, explore correlations between variables. Are certain weather variables correlated with specific housing attributes? Is there a link between wind speeds and median house value? Or between humidity and ocean proximity? Analyzing correlations allows for deeper insights into how different aspects may influence one another for a given region or area. These correlations may also inform broader patterns that are present across multiple North American regions or countries.

    Finally, use visualizations to further investigate this relationship between climate and housing attributes in North America in 2012. Graphs allow you visualize trends like seasonal variations or long-term changes over time more easily so they are useful when interpreting large amounts of data quickly while providing larger context beyond what numbers alone can tell us about relationships between different aspects within this dataset

    Research Ideas

    • Analyzing the effect of climate change on housing markets across North America. By looking at temperature and weather trends in combination with housing values, researchers can better understand how climate change may be impacting certain regions differently than others.
    • Investigating the relationship between median income, house values and ocean proximity in coastal areas. Understanding how ocean proximity plays into housing prices may help inform real estate investment decisions and urban planning initiatives related to coastal development.
    • Utilizing differences in weather patterns across different climates to determine optimal seasonal rental prices for property owners. By analyzing changes in temperature, wind speed, humidity, pressure and visibility from season to season an investor could gain valuable insights into seasonal market trends to maximize their profits from rentals or Airbnb listings over time

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: Weather.csv | Column name | Description | |:---------------------|:-----------------------------------------------| | Date/Time | Date and time of the observation. (Date/Time) | | Temp_C | Temperature in Celsius. (Numeric) | | Dew Point Temp_C | Dew point temperature in Celsius. (Numeric) | | Rel Hum_% | Relative humidity in percent. (Numeric) | | Wind Speed_km/h | Wind speed in kilometers per hour. (Numeric) | | Visibility_km | Visibilit...

  2. m

    Graphite//LFP synthetic training prognosis dataset

    • data.mendeley.com
    Updated May 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthieu Dubarry (2020). Graphite//LFP synthetic training prognosis dataset [Dataset]. http://doi.org/10.17632/6s6ph9n8zg.1
    Explore at:
    Dataset updated
    May 6, 2020
    Authors
    Matthieu Dubarry
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This training dataset was calculated using the mechanistic modeling approach. See the “Benchmark Synthetic Training Data for Artificial Intelligence-based Li-ion Diagnosis and Prognosis“ publication for mode details. More details will be added when published. The prognosis dataset was harder to define as there are no limits on how the three degradation modes can evolve. For this proof of concept work, we considered eight parameters to scan. For each degradation mode, degradation was chosen to follow equation (1).

    %degradation=a × cycle+ (exp^(b×cycle)-1) (1)

    Considering the three degradation modes, this accounts for six parameters to scan. In addition, two other parameters were added, a delay for the exponential factor for LLI, and a parameter for the reversibility of lithium plating. The delay was introduced to reflect degradation paths where plating cannot be explained by an increase of LAMs or resistance [55]. The chosen parameters and their values are summarized in Table S1 and their evolution is represented in Figure S1. Figure S1(a,b) presents the evolution of parameters p1 to p7. At the worst, the cells endured 100% of one of the degradation modes in around 1,500 cycles. Minimal LLI was chosen to be 20% after 3,000 cycles. This is to guarantee at least 20% capacity loss for all the simulations. For the LAMs, conditions were less restrictive, and, after 3,000 cycles, the lowest degradation is of 3%. The reversibility factor p8 was calculated with equation (2) when LAMNE > PT.

    %LLI=%LLI+p8 (LAM_PE-PT) (2)

    Where PT was calculated with equation (3) from [60].

    PT=100-((100-LAMPE)/(100×LRini-LAMPE ))×(100-OFSini-LLI) (3)

    Varying all those parameters accounted for more than 130,000 individual duty cycles. With one voltage curve for every 100 cycles. 6 MATLAB© .mat files are included: The GIC-LFP_duty_other.mat file contains 12 variables Qnorm: normalize capacity scale for all voltage curves

    P1 to p8: values used to generate the duty cycles

    Key: index for which values were used for each degradation paths. 1 -p1, … 8 - p8

    QL: capacity loss, one line per path, one column per 100 cycles.

    File GIC-LFP_duty_LLI-LAMsvalues.mat contains the values for LLI, LAMPE and LAMNE for all cycles (1line per 100 cycles) and duty cycles (columns).

    Files GIC-LFP_duty_1 to _4 files contains the voltage data split into 1GB chunks (40,000 simulations). Each cell corresponds to 1 line in the key variable. Inside each cell, one colunm per 100 cycles.

  3. PUBG_Dataset

    • kaggle.com
    zip
    Updated May 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohsin Raza (2021). PUBG_Dataset [Dataset]. https://www.kaggle.com/razamh/pubg-dataset
    Explore at:
    zip(67630553 bytes)Available download formats
    Dataset updated
    May 30, 2021
    Authors
    Mohsin Raza
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description

    PUBG Data Description

    In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves. You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 players per group.

    File descriptions data.csv - 151MB

    Data fields - DBNOs- Number of enemy players knocked. - assists- Number of enemy players this player damaged that were killed by teammates. - boosts - Number of boost items used. - damageDealt- Total damage dealt. Note: Self inflicted damage is subtracted. - headshotKills- Number of enemy players killed with headshots. - heals- Number of healing items used. - Id- Player’s Id - killPlace- Ranking in match of number of enemy players killed. - killPoints- Kills-based external ranking of players. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”. - killStreaks- Max number of enemy players killed in a short amount of time. - kills - Number of enemy players killed. - longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat. - matchDuration - Duration of match in seconds. - matchId - ID to identify matches. There are no matches that are in both the training and testing set. - matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches. - rankPoints - Elo-like ranking of players. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes the place of “None”. - revives - Number of times this player revived teammates. - rideDistance - Total distance traveled in vehicles measured in meters. - roadKills - Number of kills while in a vehicle. - swimDistance - Total distance traveled by swimming measured in meters. - teamKills - Number of times this player killed a teammate. - vehicleDestroys - Number of vehicles destroyed. - walkDistance - Total distance traveled on foot measured in meters. - weaponsAcquired - Number of weapons picked up. - winPoints - Win-based external ranking of players. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”. - groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time. - numGroups - Number of groups we have data for in the match. - maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements. - winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the - - match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

    REFERENCE: PUBG Finish Placement Prediction (Kernels Only)

  4. Z

    Data from: FISBe: A real-world benchmark dataset for instance segmentation...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10875062
    Explore at:
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    Max Delbrück Center
    German Cancer Research Center
    Max Delbrück Center for Molecular Medicine
    Howard Hughes Medical Institute - Janelia Research Campus
    Authors
    Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General

    For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

    Summary

    A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

    30 completely labeled (segmented) images

    71 partly labeled images

    altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

    To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

    A set of metrics and a novel ranking score for respective meaningful method benchmarking

    An evaluation of three baseline methods in terms of the above metrics and score

    Abstract

    Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

    Dataset documentation:

    We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

    FISBe Datasheet

    Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

    Files

    fisbe_v1.0_{completely,partly}.zip

    contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

    fisbe_v1.0_mips.zip

    maximum intensity projections of all samples, for convenience.

    sample_list_per_split.txt

    a simple list of all samples and the subset they are in, for convenience.

    view_data.py

    a simple python script to visualize samples, see below for more information on how to use it.

    dim_neurons_val_and_test_sets.json

    a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

    Readme.md

    general information

    How to work with the image files

    Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.

    We recommend to work in a virtual environment, e.g., by using conda:

    conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env

    How to open zarr files

    Install the python zarr package:

    pip install zarr

    Opened a zarr file with:

    import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")

    optional:import numpy as npraw_np = np.array(raw)

    Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.

    How to view zarr image files

    We recommend to use napari to view the image data.

    Install napari:

    pip install "napari[all]"

    Save the following Python script:

    import zarr, sys, napari

    raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

    viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()

    Execute:

    python view_data.py /R9F03-20181030_62_B5.zarr

    Metrics

    S: Average of avF1 and C

    avF1: Average F1 Score

    C: Average ground truth coverage

    clDice_TP: Average true positives clDice

    FS: Number of false splits

    FM: Number of false merges

    tp: Relative number of true positives

    For more information on our selected metrics and formal definitions please see our paper.

    Baseline

    To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.

    License

    The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Citation

    If you use FISBe in your research, please use the following BibTeX entry:

    @misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

    Acknowledgments

    We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.

    Changelog

    There have been no changes to the dataset so far.All future change will be listed on the changelog page.

    Contributing

    If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

    All contributions are welcome!

  5. w

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • microdata.worldbank.org
    • nada-demo.ihsn.org
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

  6. A geometric shape regularity effect in the human brain: fMRI dataset

    • openneuro.org
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathias Sablé-Meyer; Lucas Benjamin; Cassandra Potier Watkins; Chenxi He; Maxence Pajot; Théo Morfoisse; Fosca Al Roumi; Stanislas Dehaene (2025). A geometric shape regularity effect in the human brain: fMRI dataset [Dataset]. http://doi.org/10.18112/openneuro.ds006010.v1.0.1
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Mathias Sablé-Meyer; Lucas Benjamin; Cassandra Potier Watkins; Chenxi He; Maxence Pajot; Théo Morfoisse; Fosca Al Roumi; Stanislas Dehaene
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A geometric shape regularity effect in the human brain: fMRI dataset

    Authors:

    • Mathias Sablé-Meyer*
    • Lucas Benjamin
    • Cassandra Potier Watkins
    • Chenxi He
    • Maxence Pajot
    • Théo Morfoisse
    • Fosca Al Roumi
    • Stanislas Dehaene

    *Corresponding author: mathias.sable-meyer@ucl.ac.uk

    Abstract

    The perception and production of regular geometric shapes is a characteristic trait of human cultures since prehistory, whose neural mechanisms are unknown. Behavioral studies suggest that humans are attuned to discrete regularities such as symmetries and parallelism, and rely on their combinations to encode regular geometric shapes in a compressed form. To identify the relevant brain systems and their dynamics, we collected functional MRI and magnetoencephalography data in both adults and six-year-olds during the perception of simple shapes such as hexagons, triangles and quadrilaterals. The results revealed that geometric shapes, relative to other visual categories, induce a hypoactivation of ventral visual areas and an overactivation of the intraparietal and inferior temporal regions also involved in mathematical processing, whose activation is modulated by geometric regularity. While convolutional neural networks captured the early visual activity evoked by geometric shapes, they failed to account for subsequent dorsal parietal and prefrontal signals, which could only be captured by discrete geometric features or by more advanced transformer models of vision. We propose that the perception of abstract geometric regularities engages an additional symbolic mode of visual perception.

    Notes about this dataset

    We separately share the MEG dataset at https://openneuro.org/datasets/ds006012. Below are some notes about the fMRI dataset of N=20 adult participants (sub-2xx, numbers between 204 and 223), and N=22 children (sub-3xx, numbers between 301 and 325).

    • The code for the analyses is provided at https://github.com/mathias-sm/AGeometricShapeRegularityEffectHumanBrain
      However, the analyses work from already preprocessed data. Since there is no custom code per se for the preprocessing, I have not included it in the repository. To preprocess the data as was done in the published article, here is the command and software information:
      • fMRIPrep version: 20.0.5
      • fMRIPrep command: /usr/local/miniconda/bin/fmriprep /data /out participant --participant-label <label> --output-spaces MNI152NLin6Asym:res-2 MNI152NLin2009cAsym:res-2
    • Defacing has been performed with bidsonym running the pydeface masking, and nobrainer brain registraction pipeline.
      The published analyses have been performed on the non-defaced data. I have checked for data quality on all participants after defacing. In specific cases, I may be able to request the permission to share the original, non-defaced dataset.
    • sub-325 was acquired by a different experimenter and defaced before being shared with the rest of the research team, hence why the slightly different defacing mask. That participant was also preprocessed separately, and using a more recent fMRIPrep version: 20.2.6.
    • The data associated with the children has a few missing files. Notably:
      1. sub-313 and sub-316 are missing one run of the localizer each
      2. sub-316 has no data at all for the geometry
      3. sub-308 has eno useable data for the intruder task Since all of these still have some data to contribute to either task, all available files were kept on this dataset. The analysis code reflects these inconsistencies where required with specific exceptions.
  7. Osu! Standard Rankings

    • kaggle.com
    zip
    Updated Jan 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julliane Pierre (2023). Osu! Standard Rankings [Dataset]. https://www.kaggle.com/datasets/jullianepierre/osu-standard-rankings/data
    Explore at:
    zip(3788 bytes)Available download formats
    Dataset updated
    Jan 30, 2023
    Authors
    Julliane Pierre
    Description

    Context:

    osu! is a music rhythm game that has 4 modes (check for more info). In this dataset, you can examine the rankings of the standard mode, taken on 30/01/2023 around 3 PM. The ranking is based on pp (performance points) awarded after every play, which are influenced by play accuracy and score; pps are then summed with weights: your top play will award you the whole pp points of the map, then the percentage is decreased (this can maintain balance between strong players and players who play too much). You can find here many other statistics.

    Contents:

    The dataset contains some columns (see below) reporting statistics for every player in the top 100 of the game in the standard mode. The ranking is ordered by pp. Some players seem to have the same points, but there are decimals that are not shown in the ranking chart on the site

    Variables:

    • rank: global rank (you can use this like an id too)
    • player_name: in-game nickname
    • country: country of origin
    • accuracy: mean accuracy of your top plays
    • play_count: lifetime plays
    • level: level (not very influent on stats)
    • hours: total hours played
    • performance_points: pp which determine the rankings
    • ss: number of ss plays (accuracy=100% and no miss)
    • s: number of s plays (accuracy>=93% and no miss)
    • a: number of a plays (accuracy>=93% but there are misses)
    • watched_by: number of replays of the player watched by others

    Acknowledgements:

    I created this database to use it for my upcoming project in our Data Science.

    I used the 2017 osu! rankings and description by Svidon as a reference in order to produce the 2023 osu! ranking in the top 100 as of January 30, 2023

    This data will be public and can be accessible on this link https://osu.ppy.sh/rankings/osu/performance.

    Here is his kaggle: https://www.kaggle.com/svidon

  8. Used Car Listings in Indonesia

    • kaggle.com
    zip
    Updated Oct 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Indra (2023). Used Car Listings in Indonesia [Dataset]. https://www.kaggle.com/datasets/indraputra21/used-car-listings-in-indonesia/code
    Explore at:
    zip(26021 bytes)Available download formats
    Dataset updated
    Oct 23, 2023
    Authors
    Indra
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Indonesia
    Description

    Dataset Description:

    This dataset contains information about various listings of used cars, their attributes, and features, including their brand, year of manufacture, price, installment amount, mileage, transmission type, location, license plate type, and various features such as rear camera, sunroof, auto retract mirror, and more.

    Column Descriptions:

    1. car name: The name or model of the car.
    2. brand: The brand or manufacturer of the car.
    3. year: The year the car was manufactured.
    4. mileage (km): The mileage or distance traveled by the car in kilometers (km).
    5. location: The location where the car is listed for sale.
    6. transmission: The transmission type, such as "Manual" or "Automatic."
    7. plate type: The type of license plate, which can be an even plate or an odd plate.
    8. rear camera: Indicates whether the car has a rear camera (0 for no, 1 for yes).
    9. sun roof: Indicates whether the car has a sunroof (0 for no, 1 for yes).
    10. auto retract mirror: Indicates whether the car has auto-retracting mirrors (0 for no, 1 for yes).
    11. electric parking brake: Indicates whether the car has an electric parking brake (0 for no, 1 for yes).
    12. map navigator: Indicates whether the car has a built-in map navigator (0 for no, 1 for yes).
    13. vehicle stability control: Indicates whether the car has vehicle stability control (0 for no, 1 for yes).
    14. keyless push start: Indicates whether the car has a keyless push start (0 for no, 1 for yes).
    15. sports mode: Indicates whether the car has a sports mode (0 for no, 1 for yes).
    16. 360 camera view: Indicates whether the car has a 360-degree camera view (0 for no, 1 for yes).
    17. power sliding door: Indicates whether the car has a power sliding door (0 for no, 1 for yes).
    18. auto cruise control: Indicates whether the car has auto cruise control (0 for no, 1 for yes).
    19. price (Rp): The price of the car in Indonesian Rupiah (Rp).
    20. instalment (Rp|Monthly): The monthly installment amount for the car, in Indonesian Rupiah (Rp).

    Potential Usages:

    this data set can be used for used car market analysis, price prediction etc

    Other:

    Raw data provided for anyone who wants it (in bahasa Indonesia)

    data source: scraped from https://www.carsome.id/

    image: generated using DALL·E 3

  9. DaDaDa

    • kaggle.com
    zip
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nawei_zhw (2025). DaDaDa [Dataset]. https://www.kaggle.com/datasets/naweizhw/dadada
    Explore at:
    zip(8034170 bytes)Available download formats
    Dataset updated
    May 2, 2025
    Authors
    nawei_zhw
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    DaDaDa: A Dataset for Data Products in Data Marketplaces

    Composition of DaDaDa

    DaDaDa contains metadata for 16,147 data products collected from 9 major data marketplaces. The features comprising DaDaDa are detailed below.

    • title. The title or short description of the data product.
    • url. The web address of the detail page of the data product.
    • platform. The name of the data marketplace hosting the data product.
    • provider. The name of the data provider as made available by the data marketplace. There are a total of 1,992 data providers, with “Techsalerator” being the leading provider, offering 644 data products.
    • description. The detailed description of the data product.
    • volume. The number of records.
    • size. The data size (in Byte) provided by the data product.
    • dimension. The number of data features.
    • coverage. The countries covered by the data product.
    • update_frequency. The frequency between data product updates as announced by the seller, such as “monthly”, “daily”, and “real-time”. Most data products adopt “no-update” and “daily”.
    • data_sample. The filename of the data sample if available. We download and store the data sample of data products in an additional folder.
    • category. The original category of data product may vary across different data marketplaces, each with its own way of categorization. We align the data categories from other marketplaces with the AWS Marketplace categories through manual labeling.
    • price_mode. The pricing mode of the data product. There are five pricing modes: (1) negotiation mode where data buyers need to negotiate the price with data providers, (2) free mode where the data is provided at no cost, (3) subscription mode where data buyers are charged a recurring fee on a monthly or annual basis, (4) one-off mode where data buyers pay a one-time fee to access the data permanently, and (5) usage-based mode where data buyers are charged based on the amount of data they consume, such as the volume of data downloaded or the number of API calls.
    • price. Using USD ($) as the currency unit. If the pricing mode is free or negotiation, the price is set to 0. If the pricing mode is subscription, the price represents the subscription cost for 12 months; If the pricing mode is usage-based, the price reflects the cost for a single usage.
  10. Uniform Sentinel 1-2 Dataset

    • kaggle.com
    • huggingface.co
    zip
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shamba Chowdhury (2025). Uniform Sentinel 1-2 Dataset [Dataset]. https://www.kaggle.com/datasets/shambac/uniform-sentinel-1-2-dataset
    Explore at:
    zip(25713003558 bytes)Available download formats
    Dataset updated
    Jun 9, 2025
    Authors
    Shamba Chowdhury
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset name

    UNIFORM SEN1-2

    Year of publication

    2025

    Author

    Shamba Chowdhury Ankana Ghosh Shreyashi Ghosh

    License

    CC-BY-SA-4.0 The dataset contains Copernicus data (2024). Terms and conditions apply: https://scihub.copernicus.eu/twiki/pub/SciHubWebPortal/TermsConditions/TC_Sentinel_Data_31072014.pdf

    Associated publication

    TBA

    Links

    Dataset: https://www.kaggle.com/datasets/shambac/uniform-sentinel-1-2-dataset Paper: TBA

    Dataset structure

    • Folders named in the format of 'r_XXX' and CSV files named in the format of 'data_r_XXX.csv'.
    • Each folder contains two sub folders named 's1_XXX' and 's2_XXX'.
    • s1 folder contains 256x256 grayscale Sentinel 1 images from a particular region and s2 folder contains 256x256 color Sentinel 2 images from the same region.
    • Each region folder has an accompanying data csv.

    Dataset size

    No. of files: 616,148 Storage: 53,699 MB

    Description

    The dataset has images spread uniformly across all over the world with 165 regions and 129,438 pairs of images. Thus the total number of image files in the dataset amounts to 258,876 images. An overview of the selected regions given on the worldmap is given in the figure below.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20405330%2F5a6633a532d3b6e3b03587781f50e1b6%2Funknown.png?generation=1757140433954325&alt=media" alt="">

    The information in the CSV files are basically metadata for all the images. The information are: - Coordinates: Geo-coordinates of the top-left point of the image. - Country: Name of the country where the image was captured. - Date-Time: Date and time when the image was captured. - Resolution Scale: Geospatial resolution of the image. - Temperature Region: Temperature zone of the region in the image. - Season: Season in the specific region at the time the image was captured.

    Sentinel 1 images have two more attributes to them: - Operational Mode: It is the operational/acquisition mode of the satellite it used to capture the given image. - Polarisation: It is the polarisation with which the image was captured.

    Sentinel 2 images have one unique attribute: - Bands: Sentinel 2 images come with multiple different information channels called bands, this attribute contains a list of the bands in the image.

    A grid of sample images from the dataset is given below:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20405330%2Fbb696ebf7c3317e0624ce84ced0b3731%2Funknown.png?generation=1757140607662415&alt=media" alt="">

  11. k

    FAD: A Chinese Dataset for Fake Audio Detection

    • dataon.kisti.re.kr
    • data.niaid.nih.gov
    • +1more
    Updated Jun 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoxin Ma;Jiangyan Yi (2022). FAD: A Chinese Dataset for Fake Audio Detection [Dataset]. https://dataon.kisti.re.kr/search/view.do?mode=view&svcId=de34c2d5f0649d30185d71299b5ef977
    Explore at:
    Dataset updated
    Jun 9, 2022
    Authors
    Haoxin Ma;Jiangyan Yi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fake audio detection is a growing concern and some relevant datasets have been designed for research. But there is no standard public Chinese dataset under additive noise conditions. In this paper, we aim to fill in the gap and design a
    Chinese fake audio detection dataset (FAD) for studying more generalized detection methods. Twelve mainstream speech generation techniques are used to generate fake audios. To simulate the real-life scenarios, three noise datasets are selected for
    noisy adding at five different signal noise ratios. FAD dataset can be used not only for fake audio detection, but also for detecting the algorithms of fake utterances for
    audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging.
    The FAD dataset is publicly available. The source code of baselines is available on GitHub https://github.com/ADDchallenge/FAD
    The FAD dataset is designed to evaluate the methods of fake audio detection and fake algorithms recognition and other relevant studies. To better study the robustness of the methods under noisy
    conditions when applied in real life, we construct the corresponding noisy dataset. The total FAD dataset consists of two versions: clean version and noisy version. Both versions are divided into
    disjoint training, development and test sets in the same way. There is no speaker overlap across these three subsets. Each test sets is further divided into seen and unseen test sets. Unseen test sets can
    evaluate the generalization of the methods to unknown types. It is worth mentioning that both real audios and fake audios in the unseen test set are unknown to the model.
    For the noisy speech part, we select three noise database for simulation. Additive noises are added to each audio in the clean dataset at 5 different SNRs. The additive noises of the unseen test set and the
    remaining subsets come from different noise databases. In each version of FAD dataset, there are 138400 utterances in training set, 14400 utterances in development set, 42000 utterances in seen test set, and 21000 utterances in unseen test set. More detailed statistics are demonstrated in the Tabel 2. Clean Real Audios Collection
    From the point of eliminating the interference of irrelevant factors, we collect clean real audios from
    two aspects: 5 open resources from OpenSLR platform (http://www.openslr.org/12/) and one self-recording dataset. Clean Fake Audios Generation
    We select 11 representative speech synthesis methods to generate the fake audios and one partially fake audios. Noisy Audios Simulation
    Noisy audios aim to quantify the robustness of the methods under noisy conditions. To simulate the real-life scenarios, we artificially sample the noise signals and add them to clean audios at 5 different
    SNRs, which are 0dB, 5dB, 10dB, 15dB and 20dB. Additive noises are selected from three noise databases: PNL 100 Nonspeech Sounds, NOISEX-92, and TAU Urban Acoustic Scenes. This data set is licensed with a CC BY-NC-ND 4.0 license.
    You can cite the data using the following BibTeX entry:
    @inproceedings{ma2022fad,
    title={FAD: A Chinese Dataset for Fake Audio Detection},
    author={Haoxin Ma, Jiangyan Yi, Chenglong Wang, Xunrui Yan, Jianhua Tao, Tao Wang, Shiming Wang, Le Xu, Ruibo Fu},
    booktitle={Submitted to the 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks },
    year={2022},
    }

  12. Z

    Estimated stand-off distance between ADS-B equipped aircraft and obstacles

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weinert, Andrew (2024). Estimated stand-off distance between ADS-B equipped aircraft and obstacles [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7741272
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    MIT Lincoln Laboratory
    Authors
    Weinert, Andrew
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Summary:

    Estimated stand-off distance between ADS-B equipped aircraft and obstacles. Obstacle information was sourced from the FAA Digital Obstacle File and the FHWA National Bridge Inventory. Aircraft tracks were sourced from processed data curated from the OpenSky Network. Results are presented as histograms organized by aircraft type and distance away from runways.

    Description:

    For many aviation safety studies, aircraft behavior is represented using encounter models, which are statistical models of how aircraft behave during close encounters. They are used to provide a realistic representation of the range of encounter flight dynamics where an aircraft collision avoidance system would be likely to alert. These models currently and have historically have been limited to interactions between aircraft; they have not represented the specific interactions between obstacles and aircraft equipped transponders. In response, we calculated the standoff distance between obstacles and ADS-B equipped manned aircraft.

    For robustness, this assessment considered two different datasets of manned aircraft tracks and two datasets of obstacles. For robustness, MIT LL calculated the standoff distance using two different datasets of aircraft tracks and two datasets of obstacles. This approach aligned with the foundational research used to support the ASTM F3442/F3442M-20 well clear criteria of 2000 feet laterally and 250 feet AGL vertically.

    The two datasets of processed tracks of ADS-B equipped aircraft curated from the OpenSky Network. It is likely that rotorcraft were underrepresented in these datasets. There were also no considerations for aircraft equipped only with Mode C or not equipped with any transponders. The first dataset was used to train the v1.3 uncorrelated encounter models and referred to as the “Monday” dataset. The second dataset is referred to as the “aerodrome” dataset and was used to train the v2.0 and v3.x terminal encounter model. The Monday dataset consisted of 104 Mondays across North America. The other dataset was based on observations at least 8 nautical miles within Class B, C, D aerodromes in the United States for the first 14 days of each month from January 2019 through February 2020. Prior to any processing, the datasets required 714 and 847 Gigabytes of storage. For more details on these datasets, please refer to "Correlated Bayesian Model of Aircraft Encounters in the Terminal Area Given a Straight Takeoff or Landing" and “Benchmarking the Processing of Aircraft Tracks with Triples Mode and Self-Scheduling.”

    Two different datasets of obstacles were also considered. First was point obstacles defined by the FAA digital obstacle file (DOF) and consisted of point obstacle structures of antenna, lighthouse, meteorological tower (met), monument, sign, silo, spire (steeple), stack (chimney; industrial smokestack), transmission line tower (t-l tower), tank (water; fuel), tramway, utility pole (telephone pole, or pole of similar height, supporting wires), windmill (wind turbine), and windsock. Each obstacle was represented by a cylinder with the height reported by the DOF and a radius based on the report horizontal accuracy. We did not consider the actual width and height of the structure itself. Additionally, we only considered obstacles at least 50 feet tall and marked as verified in the DOF.

    The other obstacle dataset, termed as “bridges,” was based on the identified bridges in the FAA DOF and additional information provided by the National Bridge Inventory. Due to the potential size and extent of bridges, it would not be appropriate to model them as point obstacles; however, the FAA DOF only provides a point location and no information about the size of the bridge. In response, we correlated the FAA DOF with the National Bridge Inventory, which provides information about the length of many bridges. Instead of sizing the simulated bridge based on horizontal accuracy, like with the point obstacles, the bridges were represented as circles with a radius of the longest, nearest bridge from the NBI. A circle representation was required because neither the FAA DOF or NBI provided sufficient information about orientation to represent bridges as rectangular cuboid. Similar to the point obstacles, the height of the obstacle was based on the height reported by the FAA DOF. Accordingly, the analysis using the bridge dataset should be viewed as risk averse and conservative. It is possible that a manned aircraft was hundreds of feet away from an obstacle in actuality but the estimated standoff distance could be significantly less. Additionally, all obstacles are represented with a fixed height, the potentially flat and low level entrances of the bridge are assumed to have the same height as the tall bridge towers. The attached figure illustrates an example simulated bridge.

    It would had been extremely computational inefficient to calculate the standoff distance for all possible track points. Instead, we define an encounter between an aircraft and obstacle as when an aircraft flying 3069 feet AGL or less comes within 3000 feet laterally of any obstacle in a 60 second time interval. If the criteria were satisfied, then for that 60 second track segment we calculate the standoff distance to all nearby obstacles. Vertical separation was based on the MSL altitude of the track and the maximum MSL height of an obstacle.

    For each combination of aircraft track and obstacle datasets, the results were organized seven different ways. Filtering criteria were based on aircraft type and distance away from runways. Runway data was sourced from the FAA runways of the United States, Puerto Rico, and Virgin Islands open dataset. Aircraft type was identified as part of the em-processing-opensky workflow.

    All: No filter, all observations that satisfied encounter conditions

    nearRunway: Aircraft within or at 2 nautical miles of a runway

    awayRunway: Observations more than 2 nautical miles from a runway

    glider: Observations when aircraft type is a glider

    fwme: Observations when aircraft type is a fixed-wing multi-engine

    fwse: Observations when aircraft type is a fixed-wing single engine

    rotorcraft: Observations when aircraft type is a rotorcraft

    License

    This dataset is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International(CC BY-NC-ND 4.0).

    This license requires that reusers give credit to the creator. It allows reusers to copy and distribute the material in any medium or format in unadapted form and for noncommercial purposes only. Only noncommercial use of your work is permitted. Noncommercial means not primarily intended for or directed towards commercial advantage or monetary compensation. Exceptions are given for the not for profit standards organizations of ASTM International and RTCA.

    MIT is releasing this dataset in good faith to promote open and transparent research of the low altitude airspace. Given the limitations of the dataset and a need for more research, a more restrictive license was warranted. Namely it is based only on only observations of ADS-B equipped aircraft, which not all aircraft in the airspace are required to employ; and observations were source from a crowdsourced network whose surveillance coverage has not been robustly characterized.

    As more research is conducted and the low altitude airspace is further characterized or regulated, it is expected that a future version of this dataset may have a more permissive license.

    Distribution Statement

    DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.

    © 2021 Massachusetts Institute of Technology.

    Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.

    This material is based upon work supported by the Federal Aviation Administration under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Federal Aviation Administration.

    This document is derived from work done for the FAA (and possibly others); it is not the direct product of work done for the FAA. The information provided herein may include content supplied by third parties. Although the data and information contained herein has been produced or processed from sources believed to be reliable, the Federal Aviation Administration makes no warranty, expressed or implied, regarding the accuracy, adequacy, completeness, legality, reliability or usefulness of any information, conclusions or recommendations provided herein. Distribution of the information contained herein does not constitute an endorsement or warranty of the data or information provided herein by the Federal Aviation Administration or the U.S. Department of Transportation. Neither the Federal Aviation Administration nor the U.S. Department of Transportation shall be held liable for any improper or incorrect use of the information contained herein and assumes no responsibility for anyone’s use of the information. The Federal Aviation Administration and U.S. Department of Transportation shall not be liable for any claim for any loss, harm, or other damages arising from access to or use of data or information, including without limitation any direct, indirect, incidental, exemplary, special or consequential damages, even if advised of the possibility of such damages. The Federal Aviation Administration shall not be liable to anyone for any decision made or action taken, or not taken, in reliance on the information contained

  13. XMM-Newton OM Object Catalog - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). XMM-Newton OM Object Catalog - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/xmm-newton-om-object-catalog
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The Optical Monitor Catalog of serendipitous sources (OMCat) contains entries for every source detected in the publicly available XMM-Newton Optical Monitor (OM) images taken in the imaging mode. Since the OM records data simultaneously with the X-ray telescopes on XMM-Newton, it typically produces images in one or more near-UV/optical bands for every pointing of the observatory. As of the beginning of 2014, the data in the public archive covered roughly 0.5% of the sky in 3425 fields. The OMCat is not dominated by sources previously undetected at other wavelengths; the bulk of objects have optical counterparts. However, the OMCat can be used to extend optical or X-ray spectral energy distributions for known objects into the ultraviolet, to study at higher angular resolution objects detected with GALEX, or to find high-Galactic-latitude objects of interest for UV spectroscopy. Differences between the current OMCat and the previous version of the OMCat (which was designated as XMMOMOBJ) are improved coordinates, improved quality flags, and a reduced number of spurious sources. The OM reduction was done with the standard ESAS software, with post-processing to apply the coordinate corrections in a more consistent manner. There is a major change in the way the data are represented in the table. In the previous XMMOMOBJ table a separate row was generated for each filter. In the current XMMOMCAT table each observation of each object generates only a single row regardless of how many filters were used. Unused filters have nulls while filters where the object is not detected have nulls for the detection parameters but a non-zero value for exposure. The table includes information for each filter and averaged information for the object as a whole. Only filters in which the object was detected are used in the averages. The parameters in this table comprise two sets: parameters describing the detection overall including id's and mean values, and values specific to the individual bands. There are three possible situations for the band data: (1) If there was no exposure in that band, then all fields for that band will be null. (2) If there was some exposure in the band but the object was not detected in that band, then the exposure field will give the actual exposure, but all of the other fields for that band will be null. (3) If the object was detected, then all of the fields for the band should be filled in. The filters included are V, B, U, UVW1, UVM2, UVW2 and white (i.e., unfiltered). The original table (formerly known as XMMOMOBJ) was created by the HEASARC in March 2008, based on a table supplied by the authors. The XMMOMCAT version was generated and ingested in February 2014 using a program which concatenated the objects detected in processing each observation. This is a service provided by NASA HEASARC .

  14. Good Growth Plan 2014-2019 - Japan

    • microdata.worldbank.org
    • datacatalog.ihsn.org
    • +1more
    Updated Jan 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syngenta (2023). Good Growth Plan 2014-2019 - Japan [Dataset]. https://microdata.worldbank.org/index.php/catalog/5634
    Explore at:
    Dataset updated
    Jan 27, 2023
    Dataset authored and provided by
    Syngenta
    Time period covered
    2014 - 2019
    Area covered
    Japan
    Description

    Abstract

    Syngenta is committed to increasing crop productivity and to using limited resources such as land, water and inputs more efficiently. Since 2014, Syngenta has been measuring trends in agricultural input efficiency on a global network of real farms. The Good Growth Plan dataset shows aggregated productivity and resource efficiency indicators by harvest year. The data has been collected from more than 4,000 farms and covers more than 20 different crops in 46 countries. The data (except USA data and for Barley in UK, Germany, Poland, Czech Republic, France and Spain) was collected, consolidated and reported by Kynetec (previously Market Probe), an independent market research agency. It can be used as benchmarks for crop yield and input efficiency.

    Geographic coverage

    National coverage

    Analysis unit

    Agricultural holdings

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A. Sample design Farms are grouped in clusters, which represent a crop grown in an area with homogenous agro- ecological conditions and include comparable types of farms. The sample includes reference and benchmark farms. The reference farms were selected by Syngenta and the benchmark farms were randomly selected by Kynetec within the same cluster.

    B. Sample size Sample sizes for each cluster are determined with the aim to measure statistically significant increases in crop efficiency over time. This is done by Kynetec based on target productivity increases and assumptions regarding the variability of farm metrics in each cluster. The smaller the expected increase, the larger the sample size needed to measure significant differences over time. Variability within clusters is assumed based on public research and expert opinion. In addition, growers are also grouped in clusters as a means of keeping variances under control, as well as distinguishing between growers in terms of crop size, region and technological level. A minimum sample size of 20 interviews per cluster is needed. The minimum number of reference farms is 5 of 20. The optimal number of reference farms is 10 of 20 (balanced sample).

    C. Selection procedure The respondents were picked randomly using a “quota based random sampling” procedure. Growers were first randomly selected and then checked if they complied with the quotas for crops, region, farm size etc. To avoid clustering high number of interviews at one sampling point, interviewers were instructed to do a maximum of 5 interviews in one village.

    BF Screened from Japan were selected based on the following criterion: Location: Hokkaido Tokachi (JA Memuro, JA Otofuke, JA Tokachi Shimizu, JA Obihiro Taisho) --> initially focus on Memuro, Otofuke, Tokachi Shimizu, Obihiro Taisho // Added locations in GGP 2015 due to change of RF: Obhiro, Kamikawa, Abashiri
    BF: no use of in furrow application (Amigo) - no use of Amistar

    Contract farmers of snacks and other food companies --> screening question: 'Do you have quality contracts in place with snack and food companies for your potato production? Y/N --> if no, screen out

    Increase of marketable yield --> screening question: 'Are you interested in growing branded potatoes (premium potatoes for processing industry)? Y/N --> if no, screen out

    Potato growers for process use
    Background info: No mention of Syngenta Background info: - Labor cost is very serious issue: In general, labor cost in Japan is very high. Growers try to reduce labor cost by mechanization. Percentage of labor cost in production cost. They would like to manage cost of labor - Quality and yield driven

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    Data collection tool for 2019 covered the following information:

    (A) PRE- HARVEST INFORMATION

    PART I: Screening PART II: Contact Information PART III: Farm Characteristics a. Biodiversity conservation b. Soil conservation c. Soil erosion d. Description of growing area e. Training on crop cultivation and safety measures PART IV: Farming Practices - Before Harvest a. Planting and fruit development - Field crops b. Planting and fruit development - Tree crops c. Planting and fruit development - Sugarcane d. Planting and fruit development - Cauliflower e. Seed treatment

    (B) HARVEST INFORMATION

    PART V: Farming Practices - After Harvest a. Fertilizer usage b. Crop protection products c. Harvest timing & quality per crop - Field crops d. Harvest timing & quality per crop - Tree crops e. Harvest timing & quality per crop - Sugarcane f. Harvest timing & quality per crop - Banana g. After harvest PART VI - Other inputs - After Harvest a. Input costs b. Abiotic stress c. Irrigation

    See all questionnaires in external materials tab

    Cleaning operations

    Data processing:

    Kynetec uses SPSS (Statistical Package for the Social Sciences) for data entry, cleaning, analysis, and reporting. After collection, the farm data is entered into a local database, reviewed, and quality-checked by the local Kynetec agency. In the case of missing values or inconsistencies, farmers are re-contacted. In some cases, grower data is verified with local experts (e.g. retailers) to ensure data accuracy and validity. After country-level cleaning, the farm-level data is submitted to the global Kynetec headquarters for processing. In the case of missing values or inconsistences, the local Kynetec office was re-contacted to clarify and solve issues.

    Quality assurance Various consistency checks and internal controls are implemented throughout the entire data collection and reporting process in order to ensure unbiased, high quality data.

    • Screening: Each grower is screened and selected by Kynetec based on cluster-specific criteria to ensure a comparable group of growers within each cluster. This helps keeping variability low.

    • Evaluation of the questionnaire: The questionnaire aligns with the global objective of the project and is adapted to the local context (e.g. interviewers and growers should understand what is asked). Each year the questionnaire is evaluated based on several criteria, and updated where needed.

    • Briefing of interviewers: Each year, local interviewers - familiar with the local context of farming -are thoroughly briefed to fully comprehend the questionnaire to obtain unbiased, accurate answers from respondents.

    • Cross-validation of the answers: o Kynetec captures all growers' responses through a digital data-entry tool. Various logical and consistency checks are automated in this tool (e.g. total crop size in hectares cannot be larger than farm size) o Kynetec cross validates the answers of the growers in three different ways: 1. Within the grower (check if growers respond consistently during the interview) 2. Across years (check if growers respond consistently throughout the years) 3. Within cluster (compare a grower's responses with those of others in the group) o All the above mentioned inconsistencies are followed up by contacting the growers and asking them to verify their answers. The data is updated after verification. All updates are tracked.

    • Check and discuss evolutions and patterns: Global evolutions are calculated, discussed and reviewed on a monthly basis jointly by Kynetec and Syngenta.

    • Sensitivity analysis: sensitivity analysis is conducted to evaluate the global results in terms of outliers, retention rates and overall statistical robustness. The results of the sensitivity analysis are discussed jointly by Kynetec and Syngenta.

    • It is recommended that users interested in using the administrative level 1 variable in the location dataset use this variable with care and crosscheck it with the postal code variable.

    Data appraisal

    Due to the above mentioned checks, irregularities in fertilizer usage data were discovered which had to be corrected:

    For data collection wave 2014, respondents were asked to give a total estimate of the fertilizer NPK-rates that were applied in the fields. From 2015 onwards, the questionnaire was redesigned to be more precise and obtain data by individual fertilizer products. The new method of measuring fertilizer inputs leads to more accurate results, but also makes a year-on-year comparison difficult. After evaluating several solutions to this problems, 2014 fertilizer usage (NPK input) was re-estimated by calculating a weighted average of fertilizer usage in the following years.

  15. MMS 4 Electron Drift Instrument (EDI) Quality Zero Counts, Level 2 (L2),...

    • catalog.data.gov
    • heliophysicsdata.gsfc.nasa.gov
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MMS Science Data Center;NASA Space Physics Data Facility (SPDF) Coordinated Data Analysis Web (CDAWeb) Data Services (2025). MMS 4 Electron Drift Instrument (EDI) Quality Zero Counts, Level 2 (L2), Survey Mode, 0.125 s Data [Dataset]. https://catalog.data.gov/dataset/mms-4-electron-drift-instrument-edi-quality-zero-counts-level-2-l2-survey-mode-0-125-s-dat
    Explore at:
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Electron Drift Instrument (EDI) Q0 Survey, Level 2, 0.125 s Data (8 samples/s). EDI has two scientific data acquisition modes, called electric field mode and ambient mode. In electric field mode, two coded electron beams are emitted such that they return to the detectors after one or more gyrations in the ambient magnetic and electric field. The firing directions and times-of-flight allow the derivation of the drift velocity and electric field. In ambient mode, the electron beams are not used. The detectors with their large geometric factors and their ability to adjust the field of view quickly allow continuous sampling of ambient electrons at a selected pitch angle and fixed but selectable energy. To find the beam directions that will hit the detector, EDI sweeps each beam in the plane perpendicular to B at a fixed angular rate of 0.22 °/ms until a signal has been acquired by the detector. Once signal has been acquired, the beams are swept back and forth to stay on target. Beam detection is not determined from the changes in the count-rates directly, but from the square of the beam counts divided by the background counts from ambient electrons, i.e., from the square of the instantaneous signal-to-noise ratio (SNR). This quantity is computed from data provided by the correlator in the Gun-Detector Electronics that also generates the coding pattern imposed on the outgoing beams. If the squared SNR ratio exceeds a threshold, this is taken as evidence that the beam is returning to the detector. The thresholds for SNR are chosen dependent on background fluxes. They represent a compromise between getting false hits (induced by strong variations in background electron fluxes) and missing true beam hits. The basic software loop that controls EDI operations is executed every 2 ms. As the times when the beams hit their detectors are neither synchronized with the telemetry nor equidistant, EDI data have no fixed time-resolution. Data are reported in telemetry slots. In Survey, using the standard packing mode 0, there are eight telemetry slots per second and Gyn Detector Unit (GDU). The last beam detected during the previous slot will be reported in the current slot. If no beam has been detected, the data quality will be set to zero. In Burst telemetry there are 128 slots per second and GDU. The data in each slot consists of information regarding the beam firing directions (stored in the form of analytic gun deflection voltages), times-of-flight (if successfully measured), quality indicators, time stamps of the beam hits, and some auxiliary correlator-related information. Whenever EDI is not in electron drift mode, it uses its ambient electron mode. The mode has the capability to sample at either 90 degrees pitch angle or at 0/180 degrees (field aligned), or to alternate between 90 degrees and field aligned with selectable dwell times. While all options have been demonstrated during the commissioning phase, only the field aligned mode has been used in the routine operations phase. The choices for energy are 250 eV, 500 eV, and 1 keV. The two detectors, which are facing opposite hemispheres, are looking strictly into opposite directions, so while one detector is looking along B the other is looking antiparallel to B (corresponding to pitch angles of 180 and 0 degrees, respectively). The two detectors switch roles every half spin of the spacecraft as the tip of the magnetic field vector spins outside the field of view of one detector and into the field of view of the other detector. These data are a by-product generated from data collected in electric field mode. Whenever no return beam is found in a particular time slot by the flight software to be reported will be flagged with the lowest quality level (quality zero). The ground processing generates a separate data product from these counts data. The EDI instrument paper can be found at: http://link.springer.com/article/10.1007%2Fs11214-015-0182-7. The EDI instrument data products guide can be found at https://lasp.colorado.edu/mms/sdc/public/datasets/fields/.

  16. MMS 1 Electron Drift Instrument (EDI) Quality Zero Counts, Level 2 (L2),...

    • catalog.data.gov
    • heliophysicsdata.gsfc.nasa.gov
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MMS Science Data Center;NASA Space Physics Data Facility (SPDF) Coordinated Data Analysis Web (CDAWeb) Data Services (2025). MMS 1 Electron Drift Instrument (EDI) Quality Zero Counts, Level 2 (L2), Survey Mode, 0.125 s Data [Dataset]. https://catalog.data.gov/dataset/mms-1-electron-drift-instrument-edi-quality-zero-counts-level-2-l2-survey-mode-0-125-s-dat
    Explore at:
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Electron Drift Instrument (EDI) Q0 Survey, Level 2, 0.125 s Data (8 samples/s). EDI has two scientific data acquisition modes, called electric field mode and ambient mode. In electric field mode, two coded electron beams are emitted such that they return to the detectors after one or more gyrations in the ambient magnetic and electric field. The firing directions and times-of-flight allow the derivation of the drift velocity and electric field. In ambient mode, the electron beams are not used. The detectors with their large geometric factors and their ability to adjust the field of view quickly allow continuous sampling of ambient electrons at a selected pitch angle and fixed but selectable energy. To find the beam directions that will hit the detector, EDI sweeps each beam in the plane perpendicular to B at a fixed angular rate of 0.22 °/ms until a signal has been acquired by the detector. Once signal has been acquired, the beams are swept back and forth to stay on target. Beam detection is not determined from the changes in the count-rates directly, but from the square of the beam counts divided by the background counts from ambient electrons, i.e., from the square of the instantaneous signal-to-noise ratio (SNR). This quantity is computed from data provided by the correlator in the Gun-Detector Electronics that also generates the coding pattern imposed on the outgoing beams. If the squared SNR ratio exceeds a threshold, this is taken as evidence that the beam is returning to the detector. The thresholds for SNR are chosen dependent on background fluxes. They represent a compromise between getting false hits (induced by strong variations in background electron fluxes) and missing true beam hits. The basic software loop that controls EDI operations is executed every 2 ms. As the times when the beams hit their detectors are neither synchronized with the telemetry nor equidistant, EDI data have no fixed time-resolution. Data are reported in telemetry slots. In Survey, using the standard packing mode 0, there are eight telemetry slots per second and Gyn Detector Unit (GDU). The last beam detected during the previous slot will be reported in the current slot. If no beam has been detected, the data quality will be set to zero. In Burst telemetry there are 128 slots per second and GDU. The data in each slot consists of information regarding the beam firing directions (stored in the form of analytic gun deflection voltages), times-of-flight (if successfully measured), quality indicators, time stamps of the beam hits, and some auxiliary correlator-related information. Whenever EDI is not in electron drift mode, it uses its ambient electron mode. The mode has the capability to sample at either 90 degrees pitch angle or at 0/180 degrees (field aligned), or to alternate between 90 degrees and field aligned with selectable dwell times. While all options have been demonstrated during the commissioning phase, only the field aligned mode has been used in the routine operations phase. The choices for energy are 250 eV, 500 eV, and 1 keV. The two detectors, which are facing opposite hemispheres, are looking strictly into opposite directions, so while one detector is looking along B the other is looking antiparallel to B (corresponding to pitch angles of 180 and 0 degrees, respectively). The two detectors switch roles every half spin of the spacecraft as the tip of the magnetic field vector spins outside the field of view of one detector and into the field of view of the other detector. These data are a by-product generated from data collected in electric field mode. Whenever no return beam is found in a particular time slot by the flight software to be reported will be flagged with the lowest quality level (quality zero). The ground processing generates a separate data product from these counts data. The EDI instrument paper can be found at: http://link.springer.com/article/10.1007%2Fs11214-015-0182-7. The EDI instrument data products guide can be found at https://lasp.colorado.edu/mms/sdc/public/datasets/fields/.

  17. i

    National Risk and Vulnerability Assessment 2005 - Afghanistan

    • catalog.ihsn.org
    • datacatalog.ihsn.org
    Updated Mar 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Central Statistics Office (CSO) (2019). National Risk and Vulnerability Assessment 2005 - Afghanistan [Dataset]. https://catalog.ihsn.org/index.php/catalog/934
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    Central Statistics Office (CSO)
    Time period covered
    2005
    Area covered
    Afghanistan
    Description

    Abstract

    The primary objective of NRVA 2005 is to collect information at community and household level to better understand livelihoods of Kuchi (nomadic pastoralists), rural and urban households throughout the country, and to determine the types of risks and vulnerabilities they face. National and international stakeholders can benefit from the summarized findings of the report or the data set made available for in-depth analysis to develop strategies to address the short, medium, and long-term needs of the nomadic, rural and urban populations through better informed and timely policy development and intervention strategies.

    The 2005 Assessment takes into account a series of recommendations made by several stakeholders during a workshop held in June 2004 when the preliminary NRVA 2003 results were discussed. The assessment includes urban households allowing a more comprehensive appreciation of the status of the country in the summer of 2005.

    Geographic coverage

    The survey covered 34 provinces excluding 6 districts.

    Analysis unit

    Community (Shura), Households, and Individuals

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A sample of 30,822 households from 34 provinces (1,735 Kuchi, 23,220 rural and 5,867 urban) was drawn excluding 6 districts that were not enumerated (as CSO household listing data was not available at the time of sampling the Livestock Census [FAO, 2003] data was used). Twelve districts were enumerated only by male surveyors in all Zabul (11 districts) and Maruf district in Kandahar due to security restrictions; however, in the se districts the food consumption part of the female questionnaire was filled out by male enumerators interviewing male respondents.

    Rural and Urban Settled Households

    The analytical domain, the unit at which the data are statistically representative, is at the level of 34 rural provinces; in contrast to NRVA 2003, the province of Uruzgan was split into smaller Uruzgan and Daykundi; the same happened to Parwan, which was split into Parwan and Panjsher. In addition to these 34 provincial analytical domains, there are 10 urban areas with populations larger than 10,000 households.

    The survey has also collected data representative of these 10 urban domains. Thus, there are 44 settled analytical domains. Because Kuchi have been considered as one national analytical domain, there are a total of 45 analytical domains for NRVA 2005. Collecting representative data with a proportional sample at the provincial level creates a challenge because of the large variation in provincial population from the smallest population in the province of Nimroz, with only 13,941 rural households, to Hirat, with 226,650 rural households. To adjust the sampling to the available budget, the province Jawzjan with 50,900 rural households, has been used as the base analytical domain for which the sampling fraction has been determined. For those domains with populations less than Jawzjan, and where the sample fraction delivered less than 350 households, further clusters were added to ensure a minimum sample size of 350 households. The sample is therefore not self-weighting.

    For those provinces or districts within provinces where the sample frame was not yet available at the time of sampling (42 districts), the Livestock Census database was used to draw a sample. On arrival at a village, the number of households was determined during the male community interview. As it was difficult for the enumerators to predict the number of households within dwellings, an additional question was asked for the total number of dwellings in the village. This number was divided by 12, to create a sampling interval for households within the community. The enumerators then selected a household each time they counted the sampling interval houses. By using this method, the sampled households were randomly and spread equally throughout the village.

    Kuchi households

    The household listing conducted by CSO did not effectively include the migratory Kuchi population to the date of the survey; hence there was no effective sampling frame for this population. Apparently, this lack of enumeration of the Kuchi population includes those that have recently settled. This is exactly the same population that was surveyed during winter/spring 2004 by the National Multi-Sectoral Assessment for Kuchi (NMAK), i.e. the Kuchi that is still nomadic and those that have recently settled since the onset of the last drought period. This is the best estimate of the current Kuchi population. The unit of observation for the survey was the Kuchi communities in their winter location, where one or more Kuchi communities may have been located. The sample frame for the survey was created by constructing the predicted Kuchi populations in their summer location, for which information was collected from the NMAK 2004 survey.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The core of NRVA 2005 is being formed by the household questionnaire. The household questionnaire consisted of the following 18 sections; the first 14 were answered by the male head of household or male respondent, and the last four by the female members of the household: - Household register and education; - Housing; - Household facilities; - Drinking water; - Assets and credit; - Livestock; - Agriculture and land tenure; - Migration, remittance and social networks; - Sources of income; - Households expenditures; - Cash for work; - Food Aid and iodized salt; - Household shocks and coping strategies; - HIV/AIDS; - Food consumption; - Maternal child health; - Children 0 – 59 months; - HIV/AIDS and literacy test.

    The total number of questions that were asked to the sampled households exceeded 260 but not all questions were answered because some of them were eliminated based on the responses provided (with skipping rules). The household is regarded as the unit of analysis. In Afghanistan there is a need to address the questions to males and females depending on their nature. In every sampled community 12 households have been interviewed. On average the time required to answer the household questionnaire was less than two hours. Besides the household questionnaire, information was gathered at community level. Therefore, two community questionnaires were designed – one male and one female. These two questionnaires addressed the following topics:

    Male shura questionnaire: - Community information; - Access to infrastructure; - Markets access; - Health access; - Education; - Community roles and governance; - Programme activities; - Community priorities; - Water table.

    Female shura questionnaire: - Health access; - Community bodies and governance; - Community priorities.

    Cleaning operations

    Automated data entry

    Teleform Enterprise version 8 (Cardiff software, donated by WFP) was used throughout the process to scan the NRVA 2005 Teleform questionnaires filled in the field. Teleform is an electronic pre-programmed method of gathering data (optical readable software), often used for its speed and accuracy in large surveys and censuses. A scanner capable of processing 60 sheets per minute was used. Unlike NRVA 2003, where Teleform was only used for the shura and wealth group data after being transcribed by VAM and key enumerator staff into scan able formats; finally the information was scanned into a Microsoft Access database using Teleform.

    The NRVA 2005 was completely designed in Teleform; then the enumerators filled in the pre-designed questionnaire sheets and the data were directly scanned into the Access database. Scanning 1.3 million data sheets took two to three months more than anticipated; the process was finally finished in February 2006. These delays were partially due to the quality of enumeration of questionnaires, computer hardware that was not powerful enough to sustain the processing required (alleviated by the loan of a high-speed server from UNOPS) and the absence of a stable electricity supply (alleviated by the loan of the power generator from WFP).

    Once the data were scanned, the programme logically checked if the number of responses per question was not exceeded. Unfortunately, within NRVA 2005 a decision was taken to insert the number of the response within the answer circles. This resulted in some false positive answers as a high percentage of the answer circles were already coloured. Only when a true answer was also indicated (giving two responses) the programme stopped asking for verification, if there was no response then the false positive was accepted and these responses were taken out during normal cleaning practices. Once a questionnaire was validated, the image file was deleted and the data was written to the Access database. Descriptive statistics were estimated with SPSS and Genstat. Cluster analysis using ADATTI software was used for food security profiling. Provincial statistics produced are included in the Annex; those for national, Kuchi, rural and urban categories are included in the main body of the document.

    Data appraisal

    Data constraints and limitations

    In spite of the time spent on the design of the questionnaire and its implementation in NRVA 2005, the data gathered have the following limitations: - Seasonality. Food security assessment and household perceptions are only valid for the summer season, rather than for the whole year. - Limited data on non-food consumption. Due to the multilateral nature of the assessment most of the non-food consumptions (except communication costs) have been included as groups to avoid an exhaustive questionnaire with a strong risk of lowering the quality of data. - Income.

  18. Semi-leptonic ttbar full-event unfolding R&D dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Greif; Kevin Greif; Michael Fenton; Michael Fenton (2024). Semi-leptonic ttbar full-event unfolding R&D dataset [Dataset]. http://doi.org/10.5281/zenodo.13364827
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Aug 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kevin Greif; Kevin Greif; Michael Fenton; Michael Fenton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was generated for the purpose of developing unfolding methods that leverage generative machine learning models. It consists of two pieces: one piece contains events with the Standard Model (SM) production of a top-quark pair in the semi-leptonic decay mode, and the other contains events with top-quark pair production modified by a non-zero EFT operator. The SM dataset contains 15,015,000 events, and the EFT dataset contains 30,000,000. Both datasets store the following event configurations:

    • Parton level: configurations of all partons that result from the matrix element calculation done using MadGraph
    • Particle level: configurations of all “truth” jets and leptons that result from the parton shower and hadronization modelled using Pythia
    • Detector level: configurations of all reconstruction level jets and leptons, measured by a detector simulated with Delphes and the default CMS detector card.

    Each of these configurations is stored in a dedicated group as described below. Throughout, the units of energy and transverse momentum are GeV. For more details on the generation of this dataset, see Ref. [1].

    Parton level data:

    • No phase space requirements are placed on the events at parton level.
    • The kinematics of the top, anti-top, W+, W-, and all decay products are contained in groups entitled top, antitop, Wp, and Wm respectively. Each of these contains the kinematics of the parton itself in a group called “particle”, as well as the kinematics of two daughter particles, in groups called “d1” and “d2”. In the case of the tops, these daughters are the W’s and b quarks. In the case of the W’s, these are two light quarks, or a lepton and a neutrino. The “pid” vector contains the PDGID for a given particle, used to identify its type.
    • One detail is that the W’s “particle” description is not always the same as the description of the same W stored as the daughter of the tops. This results from when the W radiates some parton before decaying.

    Particle level data:

    • At particle level all leptons and jets are required to have $p_T > 25$ GeV and absolute pseudo rapidity $|\eta| < 2.5$.
    • Events at particle level are required to have at least one electron or muon and at least 4 jets, of which at least two are b-tagged. Event which pass or fail this criteria are marked by the vector contained in the group “mask”.
    • Electrons and muons are stored in separate groups. Each group contains a vector “mask” which is true only if there is a true particle-level electron or muon in the event, and false if this entry is zero padding.
    • Jets are clustered from stable particle level objects using the anti-kt algorithm with a radius parameter of 0.5. Jet information is stored in the group “jets”, and true jets in the event are again denoted by a true value in the vector “mask”, and zero-padding is marked by a false value. Jets additionally contain a vector “btag” which is 1 if the jet is b-tagged with the default Delphes prescription, and 0 if not.
    • Information on the missing transverse momentum (MET) is contained in the group “met”. The “met” vector gives the magnitude, and the “phi” vector gives the direction in phi of the missing transverse momentum.
    • In addition to the information on the jets, leptons, and MET, the particle level data also contain the configurations for the hadronic top, leptonic top, and ttbar system. These configurations are determined assuming the pseudo-top jet parton assignment algorithm, which is a common method used by LHC experiments when analyzing semileptonic ttbar events.

    Detector level data:

    • Requirements for leptons and jets are the same as for the particle level data.
    • The event selection is the same as the particle level data. Events which pass the selection are again denoted by a true value in the vector “mask”.
    • The data for the leptons, jets, and MET are stored analogously to particle level
    • The configurations of the top quarks and ttbar system are not pre-computed at detector level, since ideally a generative unfolding method would not assume a given jet-carton assignment algorithm when it is being trained. However if the user wishes to pursue such an application, the relevant configurations can be obtained by running the pseudo-top algorithm [2].

    Citations:

    [1] - https://arxiv.org/abs/2404.14332

    [2] - https://twiki.cern.ch/twiki/bin/view/LHCPhysics/ParticleLevelTopDefinitions

  19. g

    Opal Tap On and Tap Off Release 2

    • gimi9.com
    • opendata.transport.nsw.gov.au
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opal Tap On and Tap Off Release 2 [Dataset]. https://gimi9.com/dataset/au_nsw-2-opal-tap-on-and-tap-off-release-2/
    Explore at:
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides counts of tap ons and tap offs made on the Opal ticketing system during two non-consecutive weeks in 2016. The Opal tap on and tap off dataset contains six CSV files covering two weeks (14 days) of Opal data across the four public transport modes. Privacy is the utmost priority for all Transport for NSW Open Data and there is no information that can identify any individual in the Open Opal Tap On and Tap Off data. This means that any data that is, or can be, linked to an individual’s Opal card has been removed. This dataset is subject to specific terms and conditions There are three CSV files per week, and these provide a privacy-protected count of taps against: Time – binned to 15 minutes by tap (tap on or tap off), by date and by mode Location– by tap (tap on or tap off), by date and by mode Time with location – binned to 15 minutes, by tap (tap on or tap off), by date and by mode The tap on and tap off counts are not linked and individual trips cannot be derived using the data. The two weeks of Opal data are: Monday 21 November 2016 – Sunday 27 November 2016 Monday 26 December 2016 – Sunday 1 January 2017 Release 1 files are also linked below.

  20. Labour Market Dynamics in South Africa 2022 - South Africa

    • datafirst.uct.ac.za
    Updated Dec 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics South Africa (2024). Labour Market Dynamics in South Africa 2022 - South Africa [Dataset]. https://www.datafirst.uct.ac.za/dataportal/index.php/catalog/1011
    Explore at:
    Dataset updated
    Dec 17, 2024
    Dataset authored and provided by
    Statistics South Africahttp://www.statssa.gov.za/
    Time period covered
    2022
    Area covered
    South Africa
    Description

    Abstract

    The Quarterly Labour Force Survey (QLFS) is a household-based sample survey conducted by Statistics South Africa (StatsSA). It collects data on the labour market activities of individuals aged 15 years or older who live in South Africa. Since 2008, StatsSA have produced an annual dataset based on the QLFS data, "Labour Market Dynamics in South Africa". The dataset is constructed using data from all all four QLFS datasets in the year. The dataset also includes a number of variables (including income) that are not available in any of the QLFS datasets from 2010.

    Geographic coverage

    The survey had national coverage.

    Analysis unit

    Individuals

    Universe

    The QLFS sample covers the non-institutional population except for those in workers' hostels. However, persons living in private dwelling units within institutions are enumerated. For example, within a school compound, one would enumerate the schoolmaster's house and teachers' accommodation because these are private dwellings. Students living in a dormitory on the school compound would, however, be excluded.

    Kind of data

    Sample survey data

    Sampling procedure

    Each year the LMDSA is created by combining the QLFS waves for that year and then including some additional variables. The QLFS master frame for this LMDSA was based on the 2011 population census by Stas SA. The sampling is stratified by province, district, and geographic type (urban, traditional, farm). There are 3324 PSUs drawn each year, using probability proportional to size (PPS) sampling. In the second stage Dwelling Units (DUs) are systematically selected from PSUs. The 3324 PSU are split into four groups for the year, and at each quarter the DUs from the given group are replaced by substitute DUs from the same PSU or the next PSU on the list (in the same group). It should be noted that the sampling unit is the dwelling, and the unit of observation is the household. Therefore, if a household moves out of a dwelling after being in the sample for, two quarters and a new household moves in, the new household will be enumerated for two more quarters until the DU is rotated out. If no household moves into the sampled dwelling, the dwelling will be classified as vacant (or unoccupied).

    Mode of data collection

    Computer Assisted Telephone Interview

    Data appraisal

    The statistical release notes that missing values were "generally imputed" for item non-response but provides no detail on how Statistics SA did so.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). Weather and Housing in North America [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-and-housing-in-north-america
Organization logo

Weather and Housing in North America

Exploring the Relationship between Weather and Housing Conditions in 2012

Explore at:
zip(512280 bytes)Available download formats
Dataset updated
Feb 13, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered
North America
Description

Weather and Housing in North America

Exploring the Relationship between Weather and Housing Conditions in 2012

By [source]

About this dataset

This comprehensive dataset explores the relationship between housing and weather conditions across North America in 2012. Through a range of climate variables such as temperature, wind speed, humidity, pressure and visibility it provides unique insights into the weather-influenced environment of numerous regions. The interrelated nature of housing parameters such as longitude, latitude, median income, median house value and ocean proximity further enhances our understanding of how distinct climates play an integral part in area real estate valuations. Analyzing these two data sets offers a wealth of knowledge when it comes to understanding what factors can dictate the value and comfort level offered by residential areas throughout North America

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset offers plenty of insights into the effects of weather and housing on North American regions. To explore these relationships, you can perform data analysis on the variables provided.

First, start by examining descriptive statistics (i.e., mean, median, mode). This can help show you the general trend and distribution of each variable in this dataset. For example, what is the most common temperature in a given region? What is the average wind speed? How does this vary across different regions? By looking at descriptive statistics, you can get an initial idea of how various weather conditions and housing attributes interact with one another.

Next, explore correlations between variables. Are certain weather variables correlated with specific housing attributes? Is there a link between wind speeds and median house value? Or between humidity and ocean proximity? Analyzing correlations allows for deeper insights into how different aspects may influence one another for a given region or area. These correlations may also inform broader patterns that are present across multiple North American regions or countries.

Finally, use visualizations to further investigate this relationship between climate and housing attributes in North America in 2012. Graphs allow you visualize trends like seasonal variations or long-term changes over time more easily so they are useful when interpreting large amounts of data quickly while providing larger context beyond what numbers alone can tell us about relationships between different aspects within this dataset

Research Ideas

  • Analyzing the effect of climate change on housing markets across North America. By looking at temperature and weather trends in combination with housing values, researchers can better understand how climate change may be impacting certain regions differently than others.
  • Investigating the relationship between median income, house values and ocean proximity in coastal areas. Understanding how ocean proximity plays into housing prices may help inform real estate investment decisions and urban planning initiatives related to coastal development.
  • Utilizing differences in weather patterns across different climates to determine optimal seasonal rental prices for property owners. By analyzing changes in temperature, wind speed, humidity, pressure and visibility from season to season an investor could gain valuable insights into seasonal market trends to maximize their profits from rentals or Airbnb listings over time

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Weather.csv | Column name | Description | |:---------------------|:-----------------------------------------------| | Date/Time | Date and time of the observation. (Date/Time) | | Temp_C | Temperature in Celsius. (Numeric) | | Dew Point Temp_C | Dew point temperature in Celsius. (Numeric) | | Rel Hum_% | Relative humidity in percent. (Numeric) | | Wind Speed_km/h | Wind speed in kilometers per hour. (Numeric) | | Visibility_km | Visibilit...

Search
Clear search
Close search
Google apps
Main menu