Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The dataset is generated with Pythia 8, with the full detector simulation being performed by Geant4 with the CLIC-like detector setup CLICdet (CLIC_o3_v14) setup. Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP. Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.
In this version of the dataset no γγ -> hadrons background is included.
This dataset contains e+e- samples with Z->ττ, ZH,H->ττ and Z->qq events, with approximately 2 million events simulated in each category.
The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 380 GeV:
The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.
The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for both generator level jets and reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.
Additionally, a set of variables describing the tau lifetime are calculated using the software in Github. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.
In summary, the features found in the flat ntuples are:
Name | Description |
reco_cand_p4s | 4-momenta per particle in the reco jet. |
reco_cand_charge | Charge per particle in the jet. |
reco_cand_pdg | PDGid per particle in the jet. |
reco_jet_p4s | RecoJet 4-momenta. |
reco_cand_dz | Longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dz_err | Uncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dxy | Transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dxy_err | Uncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
gen_jet_p4s | GenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3. |
gen_jet_tau_decaymode | Decay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used. |
gen_jet_tau_p4s | Visible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used. |
The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.3. For each reconstructed jet, we define three target values related to τ lepton reconstruction:
File | # Jets | Size |
z_test.parquet |
870 843 | 171 MB |
z_train.parquet |
3 483 369 | 681 MB |
zh_test.parquet |
1 068 606 | 213 MB |
zh_train.parquet |
4 274 423 | 851 MB |
qq_test.parquet |
6 366 715 | 1.4 GB |
qq_train.parquet |
25 466 858 | 5.6 GB |
The dataset consists of 6 files of 8.9 GB in total.
The .parquet files can be directly loaded with the Awkward Array Python library.
An example how one might use the dataset and the features is given in data_intro.ipynb
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This training dataset was calculated using the mechanistic modeling approach. See the “Benchmark Synthetic Training Data for Artificial Intelligence-based Li-ion Diagnosis and Prognosis“ publication for mode details. More details will be added when published. The prognosis dataset was harder to define as there are no limits on how the three degradation modes can evolve. For this proof of concept work, we considered eight parameters to scan. For each degradation mode, degradation was chosen to follow equation (1).
%degradation=a × cycle+ (exp^(b×cycle)-1) (1)
Considering the three degradation modes, this accounts for six parameters to scan. In addition, two other parameters were added, a delay for the exponential factor for LLI, and a parameter for the reversibility of lithium plating. The delay was introduced to reflect degradation paths where plating cannot be explained by an increase of LAMs or resistance [55]. The chosen parameters and their values are summarized in Table S1 and their evolution is represented in Figure S1. Figure S1(a,b) presents the evolution of parameters p1 to p7. At the worst, the cells endured 100% of one of the degradation modes in around 1,500 cycles. Minimal LLI was chosen to be 20% after 3,000 cycles. This is to guarantee at least 20% capacity loss for all the simulations. For the LAMs, conditions were less restrictive, and, after 3,000 cycles, the lowest degradation is of 3%. The reversibility factor p8 was calculated with equation (2) when LAMNE > PT.
%LLI=%LLI+p8 (LAM_PE-PT) (2)
Where PT was calculated with equation (3) from [60].
PT=100-((100-LAMPE)/(100×LRini-LAMPE ))×(100-OFSini-LLI) (3)
Varying all those parameters accounted for more than 130,000 individual duty cycles. With one voltage curve for every 100 cycles. 6 MATLAB© .mat files are included: The GIC-LFP_duty_other.mat file contains 12 variables Qnorm: normalize capacity scale for all voltage curves
P1 to p8: values used to generate the duty cycles
Key: index for which values were used for each degradation paths. 1 -p1, … 8 - p8
QL: capacity loss, one line per path, one column per 100 cycles.
File GIC-LFP_duty_LLI-LAMsvalues.mat contains the values for LLI, LAMPE and LAMNE for all cycles (1line per 100 cycles) and duty cycles (columns).
Files GIC-LFP_duty_1 to _4 files contains the voltage data split into 1GB chunks (40,000 simulations). Each cell corresponds to 1 line in the key variable. Inside each cell, one colunm per 100 cycles.
The Swift XRT (Burrows et al 2005, SSRv, 120, 165) is a sensitive, broad-band (0.2 - 10 keV) X-ray imager with an effective area of about 125 cm**2 at 1.5 keV. The 600 x 600 pixel CCD at the focus provides a 23.6' x 23.6' field of view with a pixel scale of 2.36". The point spread function is 18" (HPD) at 1.5 keV. These XRT surveys represent the data from the first 12.5 years of Swift X-ray observations. They include all data taken in photon counting mode. A total of just over 8% of the sky has some non-zero exposure. The fraction of sky exposed as a function of the exposure is given in the following table: Exposure>0 10 30 100 300 1000 3000 1000 30000 100000300000 Coverage 8.42 8.37 8.29 7.67 7.29 5.68 3.40 1.26 0.35 0.044 0.00118 The individual exposure and counts maps have been combined into a Hierarchical Progressive Survey (HiPS) where the data are stored in tiles in the HEALPix projection at a number of different resulutions. The highest resolution pixels (HEALPix order 17) have a size of roughly 1.6". Data are also stored at lower resolutions at factors of 1/2, 1/4, 1/8, 1/16, and 1/32, and in an all sky image with a resolution 1/256 of the higest resolution. An intensity map has been created as the ratio of the counts and exposure maps. These surveys combine the basic count and exposure maps provided as standard products in the Swift XRT archive in obsid/xrt/products/*xpc_(sk|ex).img.gz. The surveys were created as follows: All of the exposure maps available in the archive in mid-May 2017 were combined using the CDS-developed Hipsgen tool. This includes 129,063 observations for which both count and exposure files were found in PC mode. Three exposures where there was a counts map but no exposure map were ignored. A few exposure files had more than one exposure extension. 1,082 files had two extensions and 1 file had 3 extensions. The 1084 HDUs in extensions were extracted as separate files and included in the total exposure. The value of 0 was given to the Hipsgen software as the null value for the FITS files. This caused the CDS software to treat such pixels as missing rather than 0 exposure. The counts data was extracted from the counts maps for each observation using SkyView developed software. For any pixel in which a count was recorded, the corresponding exposure file was checked and if there was any exposure (in any of the associated extensions), then the count was retained. If there was no exposure in any of the extensions of the corresponding exposure file, the counts in the pixel were omitted. Once a count was accepted, the overlap between the counts map pixel and the pixels of the corresponding HiPS tile (or tiles) was computed. Each count was then assigned entirely to a single pixel in the HiPS tile randomly but with the destination pixel probabilities weighted by area of the overlap. Thus if several pixels were found in a given counts map pixel they might be assigned to different pixels in the output image. The HiPS pixels (~1.6") used were of substantially higher resolution than the XRT resolution of 18" and somewhat higher than the counts map resolution of 2.36". A total of 183,750,428 photons were extracted from the counts maps while 15,226 were rejected as being from pixels with 0 exposure. There were 501 pixels which required special treatment as straddling the boundaries of the HEALPix projection. The resulting counts tiles were then clipped using the exposure tiles that had been previously generated. Basically this transferred the coverage of the exposure tiles to the counts tiles. Any counts pixel where the corresponding exposure pixel was a NaN was changed to a NaN to indicate that there was no coverage in this region. During the clipping process 137,730 HiPS level 8 were clipped (of 786,432 over the entire sky). There were 12,236 tiles for which there was some exposure but no counts found. During the clipping process 2 photons were found on pixels where there was no corresponding exposure in the exposure tiles. This can happen when the pixel assignment process noted above shifts a photon just outside the exposed region but should be -- as it was -- rare. These photons were deleted. After creating the clipped level 8 counts maps, level 7 to 3 tiles and an all sky map where generated by averaging pixels 2x2 to decrease each level. When adding the four pixels in the level N map together only pixels whose value was not NaN were considered. Finally an intensity map was created by dividing the counts tiles by the exposure tiles. To eliminate gross fluctuations due to rare counts in regions with very low exposure, only regions with exposure > 1 second were retained. A total of 30 photons were deleted due to this criterion. Note that while any sampler may in principle be used with these data, the Spline sampler may give unexpected results. The spline computation propogates NaNs thought the image and means that even occasional NaNs can corrupt the output image completely. NaNs are very common in this dataset. Also, if the region straddles a boundary in the HEALPix projection, the size of the requested input region is likely to exceed memory limits since the HiPS data are considered a single very large image. Provenance: Data generated from public images at HEASARC archive. This is a service of NASA HEASARC.
Electron Drift Instrument (EDI) Electric Field Survey, Level 2, 5 s Data. EDI has two scientific data acquisition modes, called electric field mode and ambient mode. In electric field mode, two coded electron beams are emitted such that they return to the detectors after one or more gyrations in the ambient magnetic and electric field. The firing directions and times-of-flight allow the derivation of the drift velocity and electric field. In ambient mode, the electron beams are not used. The detectors with their large geometric factors and their ability to adjust the field of view quickly allow continuous sampling of ambient electrons at a selected pitch angle and fixed but selectable energy. To find the beam directions that will hit the detector, EDI sweeps each beam in the plane perpendicular to B at a fixed angular rate of 0.22 °/ms until a signal has been acquired by the detector. Once signal has been acquired, the beams are swept back and forth to stay on target. Beam detection is not determined from the changes in the count-rates directly, but from the square of the beam counts divided by the background counts from ambient electrons, i.e., from the square of the instantaneous signal-to-noise ratio (SNR). This quantity is computed from data provided by the correlator in the Gun-Detector Electronics that also generates the coding pattern imposed on the outgoing beams. If the squared SNR ratio exceeds a threshold, this is taken as evidence that the beam is returning to the detector. The thresholds for SNR are chosen dependent on background fluxes. They represent a compromise between getting false hits (induced by strong variations in background electron fluxes) and missing true beam hits. The basic software loop that controls EDI operations is executed every 2 ms. As the times when the beams hit their detectors are neither synchronized with the telemetry nor equidistant, EDI data have no fixed time-resolution. Data are reported in telemetry slots. In Survey, using the standard packing mode 0, there are eight telemetry slots per second and Gyn Detector Unit (GDU). The last beam detected during the previous slot will be reported in the current slot. If no beam has been detected, the data quality will be set to zero. In Burst telemetry there are 128 slots per second and GDU. The data in each slot consists of information regarding the beam firing directions (stored in the form of analytic gun deflection voltages), times-of-flight (if successfully measured), quality indicators, time stamps of the beam hits, and some auxiliary correlator-related information. Whenever EDI is not in electron drift mode, it uses its ambient electron mode. The mode has the capability to sample at either 90 degrees pitch angle or at 0/180 degrees (field aligned), or to alternate between 90 degrees and field aligned with selectable dwell times. While all options have been demonstrated during the commissioning phase, only the field aligned mode has been used in the routine operations phase. The choices for energy are 250 eV, 500 eV, and 1 keV. The two detectors, which are facing opposite hemispheres, are looking strictly into opposite directions, so while one detector is looking along B the other is looking antiparallel to B (corresponding to pitch angles of 180 and 0 degrees, respectively). The two detectors switch roles every half spin of the spacecraft as the tip of the magnetic field vector spins outside the field of view of one detector and into the field of view of the other detector. This is the primary data product generated from data collected in electric field mode. The science data generated are drift velocity and electric field data in various coordinate systems. They are derived from triangulation and/or time-of-flight analysis. Where both methods are applicable, their results will be combined using a weighting approach based on their relative errors. The EDI instrument paper can be found at: http://link.springer.com/article/10.1007%2Fs11214-015-0182-7. The EDI instrument data products guide can be found at https://lasp.colorado.edu/mms/sdc/public/datasets/fields/.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset provides counts of tap ons and tap offs made on the Opal ticketing system during two non-consecutive weeks in 2016. The Opal tap on and tap off dataset contains six CSV files covering two weeks (14 days) of Opal data across the four public transport modes. Privacy is the utmost priority for all Transport for NSW Open Data and there is no information that can identify any individual in the Open Opal Tap On and Tap Off data. This means that any data that is, or can be, linked to an individual’s Opal card has been removed. This dataset is subject to specific terms and conditions There are three CSV files per week, and these provide a privacy-protected count of taps against:
Time – binned to 15 minutes by tap (tap on or tap off), by date and by mode
Location– by tap (tap on or tap off), by date and by mode
Time with location – binned to 15 minutes, by tap (tap on or tap off), by date and by mode
The tap on and tap off counts are not linked and individual trips cannot be derived using the data. The two weeks of Opal data are:
Monday 21 November 2016 – Sunday 27 November 2016
Monday 26 December 2016 – Sunday 1 January 2017
Release 1 files are also linked below.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Authors:
*Corresponding author: mathias.sable-meyer@ucl.ac.uk
The perception and production of regular geometric shapes is a characteristic trait of human cultures since prehistory, whose neural mechanisms are unknown. Behavioral studies suggest that humans are attuned to discrete regularities such as symmetries and parallelism, and rely on their combinations to encode regular geometric shapes in a compressed form. To identify the relevant brain systems and their dynamics, we collected functional MRI and magnetoencephalography data in both adults and six-year-olds during the perception of simple shapes such as hexagons, triangles and quadrilaterals. The results revealed that geometric shapes, relative to other visual categories, induce a hypoactivation of ventral visual areas and an overactivation of the intraparietal and inferior temporal regions also involved in mathematical processing, whose activation is modulated by geometric regularity. While convolutional neural networks captured the early visual activity evoked by geometric shapes, they failed to account for subsequent dorsal parietal and prefrontal signals, which could only be captured by discrete geometric features or by more advanced transformer models of vision. We propose that the perception of abstract geometric regularities engages an additional symbolic mode of visual perception.
We separately share the MEG dataset at https://openneuro.org/datasets/ds006012. Below are some notes about the
fMRI dataset of N=20 adult participants (sub-2xx
, numbers between 204 and
223), and N=22 children (sub-3xx
, numbers between 301 and 325).
20.0.5
/usr/local/miniconda/bin/fmriprep /data /out participant --participant-label <label> --output-spaces MNI152NLin6Asym:res-2 MNI152NLin2009cAsym:res-2
bidsonym
running the pydeface
masking,
and nobrainer
brain registraction pipeline.sub-325
was acquired by a different experimenter and defaced before being
shared with the rest of the research team, hence why the slightly different
defacing mask. That participant was also preprocessed separately, and using a
more recent fMRIPrep version: 20.2.6
.sub-313
and sub-316
are missing one run of the localizer eachsub-316
has no data at all for the geometrysub-308
has eno useable data for the intruder task
Since all of these still have some data to contribute to either task, all
available files were kept on this dataset. The analysis code reflects these
inconsistencies where required with specific exceptions.https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description: The dataset is intentionally provided for data cleansing and applying EDA techniques. This brings fun exploring and wrangling for data geeks. The data is very original so dive-in and Happy Exploring.
Features: In total the dataset contains 121 Features. Details given below.
SK_ID_CURR ID of loan in our sample TARGET Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases) NAME_CONTRACT_TYPE Identification if loan is cash or revolving CODE_GENDER Gender of the client FLAG_OWN_CAR Flag if the client owns a car FLAG_OWN_REALTY Flag if client owns a house or flat CNT_CHILDREN Number of children the client has AMT_INCOME_TOTAL Income of the client AMT_CREDIT Credit amount of the loan AMT_ANNUITY Loan annuity AMT_GOODS_PRICE For consumer loans it is the price of the goods for which the loan is given NAME_TYPE_SUITE Who was accompanying client when he was applying for the loan NAME_INCOME_TYPE Clients income type (businessman, working, maternity leave,…) NAME_EDUCATION_TYPE Level of highest education the client achieved NAME_FAMILY_STATUS Family status of the client NAME_HOUSING_TYPE What is the housing situation of the client (renting, living with parents, ...) REGION_POPULATION_RELATIVE Normalized population of region where client lives (higher number means the client lives in more populated region) DAYS_BIRTH Client's age in days at the time of application DAYS_EMPLOYED How many days before the application the person started current employment DAYS_REGISTRATION How many days before the application did client change his registration DAYS_ID_PUBLISH How many days before the application did client change the identity document with which he applied for the loan OWN_CAR_AGE Age of client's car FLAG_MOBIL Did client provide mobile phone (1=YES, 0=NO) FLAG_EMP_PHONE Did client provide work phone (1=YES, 0=NO) **FLAG_WORK_PHONE ** Did client provide home phone (1=YES, 0=NO) FLAG_CONT_MOBILE Was mobile phone reachable (1=YES, 0=NO) FLAG_PHONE Did client provide home phone (1=YES, 0=NO) FLAG_EMAIL Did client provide email (1=YES, 0=NO) OCCUPATION_TYPE What kind of occupation does the client have CNT_FAM_MEMBERS How many family members does client have REGION_RATING_CLIENT Our rating of the region where client lives (1,2,3) REGION_RATING_CLIENT_W_CITY Our rating of the region where client lives with taking city into account (1,2,3) WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply for the loan HOUR_APPR_PROCESS_START Approximately at what hour did the client apply for the loan REG_REGION_NOT_LIVE_REGION Flag if client's permanent address does not match contact address (1=different, 0=same, at region level) REG_REGION_NOT_WORK_REGION Flag if client's permanent address does not match work address (1=different, 0=same, at region level) LIVE_REGION_NOT_WORK_REGION Flag if client's contact address does not match work address (1=different, 0=same, at region level) REG_CITY_NOT_LIVE_CITY Flag if client's permanent address does not match contact address (1=different, 0=same, at city level) REG_CITY_NOT_WORK_CITY Flag if client's permanent address does not match work address (1=different, 0=same, at city level) LIVE_CITY_NOT_WORK_CITY Flag if client's contact address does not match work address (1=different, 0=same, at city level) ORGANIZATION_TYPE Type of organization where client works EXT_SOURCE_1 Normalized score from external data source EXT_SOURCE_2 Normalized score from external data source EXT_SOURCE_3 Normalized score from external data source APARTMENTS_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor BASEMENTAREA_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor YEARS_BEGINEXPLUATATION_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor YEARS_BUILD_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MED...
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Summary:
Estimated stand-off distance between ADS-B equipped aircraft and obstacles. Obstacle information was sourced from the FAA Digital Obstacle File and the FHWA National Bridge Inventory. Aircraft tracks were sourced from processed data curated from the OpenSky Network. Results are presented as histograms organized by aircraft type and distance away from runways.
Description:
For many aviation safety studies, aircraft behavior is represented using encounter models, which are statistical models of how aircraft behave during close encounters. They are used to provide a realistic representation of the range of encounter flight dynamics where an aircraft collision avoidance system would be likely to alert. These models currently and have historically have been limited to interactions between aircraft; they have not represented the specific interactions between obstacles and aircraft equipped transponders. In response, we calculated the standoff distance between obstacles and ADS-B equipped manned aircraft.
For robustness, this assessment considered two different datasets of manned aircraft tracks and two datasets of obstacles. For robustness, MIT LL calculated the standoff distance using two different datasets of aircraft tracks and two datasets of obstacles. This approach aligned with the foundational research used to support the ASTM F3442/F3442M-20 well clear criteria of 2000 feet laterally and 250 feet AGL vertically.
The two datasets of processed tracks of ADS-B equipped aircraft curated from the OpenSky Network. It is likely that rotorcraft were underrepresented in these datasets. There were also no considerations for aircraft equipped only with Mode C or not equipped with any transponders. The first dataset was used to train the v1.3 uncorrelated encounter models and referred to as the “Monday” dataset. The second dataset is referred to as the “aerodrome” dataset and was used to train the v2.0 and v3.x terminal encounter model. The Monday dataset consisted of 104 Mondays across North America. The other dataset was based on observations at least 8 nautical miles within Class B, C, D aerodromes in the United States for the first 14 days of each month from January 2019 through February 2020. Prior to any processing, the datasets required 714 and 847 Gigabytes of storage. For more details on these datasets, please refer to "Correlated Bayesian Model of Aircraft Encounters in the Terminal Area Given a Straight Takeoff or Landing" and “Benchmarking the Processing of Aircraft Tracks with Triples Mode and Self-Scheduling.”
Two different datasets of obstacles were also considered. First was point obstacles defined by the FAA digital obstacle file (DOF) and consisted of point obstacle structures of antenna, lighthouse, meteorological tower (met), monument, sign, silo, spire (steeple), stack (chimney; industrial smokestack), transmission line tower (t-l tower), tank (water; fuel), tramway, utility pole (telephone pole, or pole of similar height, supporting wires), windmill (wind turbine), and windsock. Each obstacle was represented by a cylinder with the height reported by the DOF and a radius based on the report horizontal accuracy. We did not consider the actual width and height of the structure itself. Additionally, we only considered obstacles at least 50 feet tall and marked as verified in the DOF.
The other obstacle dataset, termed as “bridges,” was based on the identified bridges in the FAA DOF and additional information provided by the National Bridge Inventory. Due to the potential size and extent of bridges, it would not be appropriate to model them as point obstacles; however, the FAA DOF only provides a point location and no information about the size of the bridge. In response, we correlated the FAA DOF with the National Bridge Inventory, which provides information about the length of many bridges. Instead of sizing the simulated bridge based on horizontal accuracy, like with the point obstacles, the bridges were represented as circles with a radius of the longest, nearest bridge from the NBI. A circle representation was required because neither the FAA DOF or NBI provided sufficient information about orientation to represent bridges as rectangular cuboid. Similar to the point obstacles, the height of the obstacle was based on the height reported by the FAA DOF. Accordingly, the analysis using the bridge dataset should be viewed as risk averse and conservative. It is possible that a manned aircraft was hundreds of feet away from an obstacle in actuality but the estimated standoff distance could be significantly less. Additionally, all obstacles are represented with a fixed height, the potentially flat and low level entrances of the bridge are assumed to have the same height as the tall bridge towers. The attached figure illustrates an example simulated bridge.
It would had been extremely computational inefficient to calculate the standoff distance for all possible track points. Instead, we define an encounter between an aircraft and obstacle as when an aircraft flying 3069 feet AGL or less comes within 3000 feet laterally of any obstacle in a 60 second time interval. If the criteria were satisfied, then for that 60 second track segment we calculate the standoff distance to all nearby obstacles. Vertical separation was based on the MSL altitude of the track and the maximum MSL height of an obstacle.
For each combination of aircraft track and obstacle datasets, the results were organized seven different ways. Filtering criteria were based on aircraft type and distance away from runways. Runway data was sourced from the FAA runways of the United States, Puerto Rico, and Virgin Islands open dataset. Aircraft type was identified as part of the em-processing-opensky workflow.
License
This dataset is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International(CC BY-NC-ND 4.0).
This license requires that reusers give credit to the creator. It allows reusers to copy and distribute the material in any medium or format in unadapted form and for noncommercial purposes only. Only noncommercial use of your work is permitted. Noncommercial means not primarily intended for or directed towards commercial advantage or monetary compensation. Exceptions are given for the not for profit standards organizations of ASTM International and RTCA.
MIT is releasing this dataset in good faith to promote open and transparent research of the low altitude airspace. Given the limitations of the dataset and a need for more research, a more restrictive license was warranted. Namely it is based only on only observations of ADS-B equipped aircraft, which not all aircraft in the airspace are required to employ; and observations were source from a crowdsourced network whose surveillance coverage has not been robustly characterized.
As more research is conducted and the low altitude airspace is further characterized or regulated, it is expected that a future version of this dataset may have a more permissive license.
Distribution Statement
DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.
© 2021 Massachusetts Institute of Technology.
Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.
This material is based upon work supported by the Federal Aviation Administration under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Federal Aviation Administration.
This document is derived from work done for the FAA (and possibly others); it is not the direct product of work done for the FAA. The information provided herein may include content supplied by third parties. Although the data and information contained herein has been produced or processed from sources believed to be reliable, the Federal Aviation Administration makes no warranty, expressed or implied, regarding the accuracy, adequacy, completeness, legality, reliability or usefulness of any information, conclusions or recommendations provided herein. Distribution of the information contained herein does not constitute an endorsement or warranty of the data or information provided herein by the Federal Aviation Administration or the U.S. Department of Transportation. Neither the Federal Aviation Administration nor the U.S. Department of
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wind Spacecraft:
The Wind spacecraft (https://wind.nasa.gov) was launched on November 1, 1994 and currently orbits the first Lagrange point between the Earth and sun. A comprehensive review can be found in Wilson et al. [2021]. It holds a suite of instruments from gamma ray detectors to quasi-static magnetic field instruments, Bo. The instruments used for this data product are the fluxgate magnetometer (MFI) [Lepping et al., 1995] and the radio receivers (WAVES) [Bougeret et al., 1995]. The MFI measures 3-vector Bo at ~11 samples per second (sps); WAVES observes electromagnetic radiation from ~4 kHz to >12 MHz which provides an observation of the upper hybrid line (also called the plasma line) used to define the total electron density and also takes time series snapshot/waveform captures of electric and magnetic field fluctuations, called TDS bursts herein.
WAVES Instrument:
The WAVES experiment [Bougeret et al., 1995] on the Wind spacecraft is composed of three orthogonal electric field antenna and three orthogonal search coil magnetometers. The electric fields are measured through five different receivers: Low Frequency FFT receiver called FFT (0.3 Hz to 11 kHz), Thermal Noise Receiver called TNR (4-256 kHz), Radio receiver band 1 called RAD1 (20-1040 kHz), Radio receiver band 2 called RAD2 (1.075-13.825 MHz), and the Time Domain Sampler (TDS). The electric field antenna are dipole antennas with two orthogonal antennas in the spin plane and one spin axis stacer antenna.
The TDS receiver allows one to examine the electromagnetic waves observed by Wind as time series waveform captures. There are two modes of operation, TDS Fast (TDSF) and TDS Slow (TDSS). TDSF returns 2048 data points for two channels of the electric field, typically Ex and Ey (i.e. spin plane components), with little to no gain below ~120 Hz (the data herein has been high pass filtered above ~150 Hz for this reason). TDSS returns four channels with three electric(magnetic) field components and one magnetic(electric) component. The search coils show a gain roll off ~3.3 Hz [e.g., see Wilson et al., 2010; Wilson et al., 2012; Wilson et al., 2013 and references therein for more details].
The original calibration of the electric field antenna found that the effective antenna lengths are roughly 41.1 m, 3.79 m, and 2.17 m for the X, Y, and Z antenna, respectively. The +Ex antenna was broken twice during the mission as of June 26, 2020. The first break occurred on August 3, 2000 around ~21:00 UTC and the second on September 24, 2002 around ~23:00 UTC. These breaks reduced the effective antenna length of Ex from ~41 m to 27 m after the first break and ~25 m after the second break [e.g., see Malaspina et al., 2014; Malaspina & Wilson, 2016].
TDS Bursts:
TDS bursts are waveform captures/snapshots of electric and magnetic field data. The data is triggered by the largest amplitude waves which exceed a specific threshold and are then stored in a memory buffer. The bursts are ranked according to a quality filter which mostly depends upon amplitude. Due to the age of the spacecraft and ubiquity of large amplitude electromagnetic and electrostatic waves, the memory buffer often fills up before dumping onto the magnetic tape drive. If the memory buffer is full, then the bottom ranked TDS burst is erased every time a new TDS burst is sampled. That is, the newest TDS burst sampled by the instrument is always stored and if it ranks higher than any other in the list, it will be kept. This results in the bottom ranked burst always being erased. Earlier in the mission, there were also so called honesty bursts, which were taken periodically to test whether the triggers were working properly. It was found that the TDSF triggered properly, but not the TDSS. So the TDSS was set to trigger off of the Ex signals.
A TDS burst from the Wind/WAVES instrument is always 2048 time steps for each channel. The sample rate for TDSF bursts ranges from 1875 samples/second (sps) to 120,000 sps. Every TDS burst is marked a unique set of numbers (unique on any given date) to help distinguish it from others and to ensure any set of channels are appropriately connected to each other. For instance, during one spacecraft downlink interval there may be 95% of the TDS bursts with a complete set of channels (i.e., TDSF has two channels, TDSS has four) while the remaining 5% can be missing channels (just example numbers, not quantitatively accurate). During another downlink interval, those missing channels may be returned if they are not overwritten. During every downlink, the flight operations team at NASA Goddard Space Fligth Center (GSFC) generate level zero binary files from the raw telemetry data. Those files are filled with data received on that date and the file name is labeled with that date. There is no attempt to sort chronologically the data within so any given level zero file can have data from multiple dates within. Thus, it is often necessary to load upwards of five days of level zero files to find as many full channel sets as possible. The remaining unmatched channel sets comprise a much smaller fraction of the total.
All data provided here are from TDSF, so only two channels. Most of the time channel 1 will be associated with the Ex antenna and channel 2 with the Ey antenna. The data are provided in the spinning instrument coordinate basis with associated angles necessary to rotate into a physically meaningful basis (e.g., GSE).
TDS Time Stamps:
Each TDS burst is tagged with a time stamp called a spacecraft event time or SCET. The TDS datation time is sampled after the burst is acquired which requires a delay buffer. The datation time requires two corrections. The first correction arises from tagging the TDS datation with an associated spacecraft major frame in house keeping (HK) data. The second correction removes the delay buffer duration. Both inaccuracies are essentially artifacts of on ground derived values in the archives created by the WINDlib software (K. Goetz, Personal Communication, 2008) found at https://github.com/lynnbwilsoniii/Wind_Decom_Code.
The WAVES instrument's HK mode sends relevant low rate science back to ground once every spacecraft major frame. If multiple TDS bursts occur in the same major frame, it is possible for the WINDlib software to assign them the same SCETs. The reason being that this top-level SCET is only accurate to within +300 ms (in 120,000 sps mode) due to the issues described above (at lower sample rates, the error can be slightly larger). The time stamp uncertainty is a positive definite value because it results from digitization rounding errors. One can correct these issues to within +10 ms if using the proper HK data.
*** The data stored here have not corrected the SCETs! ***
The 300 ms uncertainty, due to the HK corrections mentioned above, results from WINDlib trying to recreate the time stamp after it has been telemetered back to ground. If a burst stays in the TDS buffer for extended periods of time (i.e., >2 days), the interpolation done by WINDlib can make mistakes in the 11th significant digit. The positive definite nature of this uncertainty is due to rounding errors associated with the onboard DPU (digital processing unit) clock rollover. The DPU clock is a 24 bit integer clock sampling at ∼50,018.8 Hz. The clock rolls over at ∼5366.691244092221 seconds, i.e., (16*224)/50,018.8. The sample rate is a temperature sensitive issue and thus subject to change over time. From a sample of 384 different points on 14 different days, a statistical estimate of the rollover time is 5366.691124061162 ± 0.000478370049 seconds (calculated by Lynn B. Wilson III, 2008). Note that the WAVES instrument team used UR8 times, which are the number of 86,400 second days from 1982-01-01/00:00:00.000 UTC.
The method to correct the SCETs to within +10 ms, were one to do so, is given as follows:
Retrieve the DPU clock times, SCETs, UR8 times, and DPU Major Frame Numbers from the WINDlib libraries on the VAX/ALPHA systems for the TDSS(F) data of interest.
Retrieve the same quantities from the HK data.
Match the HK event number with the same DPU Major Frame Number as the TDSS(F) burst of interest.
Find the difference in DPU clock times between the TDSS(F) burst of interest and the HK event with matching major frame number (Note: The TDSS(F) DPU clock time will always be greater than the HK DPU clock if they are the same DPU Major Frame Number and the DPU clock has not rolled over).
Convert the difference to a UR8 time and add this to the HK UR8 time. The new UR8 time is the corrected UR8 time to within +10 ms.
Find the difference between the new UR8 time and the UR8 time WINDlib associates with the TDSS(F) burst. Add the difference to the DPU clock time assigned by WINDlib to get the corrected DPU clock time (Note: watch for the DPU clock rollover).
Convert the new UR8 time to a SCET using either the IDL WINDlib libraries or TMLib (STEREO S/WAVES software) libraries of available functions. This new SCET is accurate to within +10 ms.
One can find a UR8 to UTC conversion routine at https://github.com/lynnbwilsoniii/wind_3dp_pros in the ~/LYNN_PRO/Wind_WAVES_routines/ folder.
Examples of good waveforms can be found in the notes PDF at https://wind.nasa.gov/docs/wind_waves.pdf.
Data Set Description
Each Zip file contains 300+ IDL save files; one for each day of the year with available data. This data set is not complete as the software used to retrieve and calibrate these TDS bursts did not have sufficient error handling to handle some of the more nuanced bit errors or major frame errors in some of the level zero files. There is currently (as of June 27, 2020) an effort (by Keith Goetz et al.) to generate the entire TDSF and TDSS data set in one repository to be put on SPDF/CDAWeb as CDF files. Once that data set is available, it will supercede
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Sales records for the year 2011-2014 with 3 Product, 17 sub-categories over different segments is recorded. Objective is to expand the business in profitable regions based on the growth percentage and profits.
Order ID: A unique ID given to each order placed. Order Date: The date at which the order was placed. Customer Name: Name of the customer placing the order. Country: The country to which the customer belongs to. State: The state to which the customer belongs from the country. City:Detail about the city to which the customer resides in. Region: Contains the region details. Segment:The ordered product belongs to what segment. Ship Mode: The mode of shipping of the order to the customer location. Category: Contains the details about what category the product belongs to. Sub : Category: Contains the details about what sub - category the product belongs to. Product Name:The name of the product ordered by the customer. Discount: The discount applicable on a product. Sales: The actual sales happened for a particular order. Profit: Profit earned on an order. Quantity:The total quantity of the product ordered in a single order. Feedback: The feedback given by the customer on the complete shopping experience. If feedback provided, then TRUE. If no feedback provided, then FALSE.
This data-set can be helpful to analyze data to develop marketing strategies and to measure parameters like customer retention rate,churn rate etc.
This data set represents an extension of earlier computations of the internal M2-tide generation (for mode 1, Pollmann and Nycander, 2023; data set: 10.17882/92304) for the modes 2-10. The methodology is based on linear theory and is explained in Pollmann et al., 2019. In addition, we address the role of slope criticality in the open-ocean and identify which conversion rate estimates fall into (i) the linear regime, where linear theory is valid; (ii) the weakly nonlinear regime, where linear theory might underestimate, but does not strongly overestimate the nonlinear conversion; (iii) the strongly nonlinear regime, where linear theory might substantially overestimate the nonlinear conversion. Details on the procedure can be found in the associated manuscript (Geoffroy, Pollmann and Nycander, currently under revision for J. Phys. Oceanogr.). Please cite this as well as the papers mentioned above when using this data set. This data set includes the global conversion rate estimates of modes 1-10 below 400m, 700m, and 1000 m depth. We include the uncorrected estimates as well as those in the linear regime (i) and in the regime combining linear and weakly nonlinear conditions (i+ii). The associated masks, also differentiating between ridges and canyons, are also included. Note that there is no weakly nonlinear regime (ii) for canyons and that there is no strongly nonlinear regime (iii) for ridges for the first mode. Conversion rate estimates from topography patches with canyons deeper than the patch mean depth are also masked. For reasons of completeness, we add the mode-1 estimates and the mask for land and continental slopes of Pollmann & Nycander, 2023 (data set: 10.17882/92304).
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Summary
A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains
30 completely labeled (segmented) images
71 partly labeled images
altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)
To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects
A set of metrics and a novel ranking score for respective meaningful method benchmarking
An evaluation of three baseline methods in terms of the above metrics and score
Abstract
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
Dataset documentation:
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
FISBe Datasheet
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Files
fisbe_v1.0_{completely,partly}.zip
contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.
fisbe_v1.0_mips.zip
maximum intensity projections of all samples, for convenience.
sample_list_per_split.txt
a simple list of all samples and the subset they are in, for convenience.
view_data.py
a simple python script to visualize samples, see below for more information on how to use it.
dim_neurons_val_and_test_sets.json
a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.
Readme.md
general information
How to work with the image files
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env
How to open zarr files
Install the python zarr package:
pip install zarr
Opened a zarr file with:
import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")
Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.
How to view zarr image files
We recommend to use napari to view the image data.
Install napari:
pip install "napari[all]"
Save the following Python script:
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()
Execute:
python view_data.py /R9F03-20181030_62_B5.zarr
Metrics
S: Average of avF1 and C
avF1: Average F1 Score
C: Average ground truth coverage
clDice_TP: Average true positives clDice
FS: Number of false splits
FM: Number of false merges
tp: Relative number of true positives
For more information on our selected metrics and formal definitions please see our paper.
Baseline
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.
License
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Citation
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }
Acknowledgments
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.
Changelog
There have been no changes to the dataset so far.All future change will be listed on the changelog page.
Contributing
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
Electron Drift Instrument (EDI) Electric Field Burst Survey, Level 2, 0.0009765625 s Data (1024 samples/s). EDI has two scientific data acquisition modes, called electric field mode and ambient mode. In electric field mode, two coded electron beams are emitted such that they return to the detectors after one or more gyrations in the ambient magnetic and electric field. The firing directions and times-of-flight allow the derivation of the drift velocity and electric field. In ambient mode, the electron beams are not used. The detectors with their large geometric factors and their ability to adjust the field of view quickly allow continuous sampling of ambient electrons at a selected pitch angle and fixed but selectable energy. To find the beam directions that will hit the detector, EDI sweeps each beam in the plane perpendicular to B at a fixed angular rate of 0.22 °/ms until a signal has been acquired by the detector. Once signal has been acquired, the beams are swept back and forth to stay on target. Beam detection is not determined from the changes in the count-rates directly, but from the square of the beam counts divided by the background counts from ambient electrons, i.e., from the square of the instantaneous signal-to-noise ratio (SNR). This quantity is computed from data provided by the correlator in the Gun-Detector Electronics that also generates the coding pattern imposed on the outgoing beams. If the squared SNR ratio exceeds a threshold, this is taken as evidence that the beam is returning to the detector. The thresholds for SNR are chosen dependent on background fluxes. They represent a compromise between getting false hits (induced by strong variations in background electron fluxes) and missing true beam hits. The basic software loop that controls EDI operations is executed every 2 ms. As the times when the beams hit their detectors are neither synchronized with the telemetry nor equidistant, EDI data have no fixed time-resolution. Data are reported in telemetry slots. In Survey, using the standard packing mode 0, there are eight telemetry slots per second and Gyn Detector Unit (GDU). The last beam detected during the previous slot will be reported in the current slot. If no beam has been detected, the data quality will be set to zero. In Burst telemetry there are 128 slots per second and GDU. The data in each slot consists of information regarding the beam firing directions (stored in the form of analytic gun deflection voltages), times-of-flight (if successfully measured), quality indicators, time stamps of the beam hits, and some auxiliary correlator-related information. Whenever EDI is not in electron drift mode, it uses its ambient electron mode. The mode has the capability to sample at either 90 degrees pitch angle or at 0/180 degrees (field aligned), or to alternate between 90 degrees and field aligned with selectable dwell times. While all options have been demonstrated during the commissioning phase, only the field aligned mode has been used in the routine operations phase. The choices for energy are 250 eV, 500 eV, and 1 keV. The two detectors, which are facing opposite hemispheres, are looking strictly into opposite directions, so while one detector is looking along B the other is looking antiparallel to B (corresponding to pitch angles of 180 and 0 degrees, respectively). The two detectors switch roles every half spin of the spacecraft as the tip of the magnetic field vector spins outside the field of view of one detector and into the field of view of the other detector. This is the primary data product generated from data collected in electric field mode. The science data generated are drift velocity and electric field data in various coordinate systems. They are derived from triangulation and/or time-of-flight analysis. Where both methods are applicable, their results will be combined using a weighting approach based on their relative errors. The EDI instrument paper can be found at: http://link.springer.com/article/10.1007%2Fs11214-015-0182-7. The EDI instrument data products guide can be found at https://lasp.colorado.edu/mms/sdc/public/datasets/fields/.
The GHS is an annual household survey specifically designed to measure the living circumstances of South African households. The GHS collects data on education, employment, health, housing and household access to services.
The survey is representative at national level and at provincial level.
Households and individuals
The survey covered all de jure household members (usual residents) of households in the nine provinces of South Africa and residents in workers' hostels. The survey does not cover collective living quarters such as students' hostels, old age homes, hospitals, prisons and military barracks.
Sample survey data
A multi-stage, stratified random sample was drawn using probability-proportional-to-size principles. First level stratification was based on province and second-tier stratification on district council. The GHS 2009 represents the second year of a new master sample (the first year was GHS 2008) that will be used until 2010.
Face-to-face [f2f]
GHS uses questionnaires as data collection instruments
The questionnaire for the General Household Survey has undergone various changes since 2002. Significant changes were made to the GHS 2009 questionnaire and this should be borne in mind when comparing across different datasets. See GHS 2009 statistical release for a detailed report on important differences between the questionnaires.
In GHS 2009-2010:
The variable on care provision (Q129acre) in the GHS 2009 and 2010 should be used with caution. The question to collect the data (question 1.29a) asks:
"Does anyone in this household personally provide care for at least two hours per day to someone in the household who - owing to frailty, old age, disability, or ill-health cannot manage without help?"
Response codes (in the questionnaire, metadata, and dataset) are:
1 = No 2 = Yes, 2-19 hours per week 3 = Yes, 20-49 hours per week 4 = Yes, 50 + hours per week 5 = Do not know
There is inconsistency between the question, which asks about hours per day, and the response options, which record hours per week. The outcome that a respondent who gives care for one hour per day (7 hours/week) would presumably not answer this question. Someone giving care for 13 hours a week would also be excluded as though they do that do serious caregiving, which is incorrect.
In GHS 2009-2015:
The variable on land size in the General Household Survey questionnaire for 2009-2015 should be used with caution. The data comes from questions on the households' agricultural activities in Section 8 of the GHS questionnaire: Household Livelihoods: Agricultural Activities. Question 8.8b asks:
“Approximately how big is the land that the household use for production? Estimate total area if more than one piece.” One of the response category is worded as:
1 = Less than 500m2 (approximately one soccer field)
However, a soccer field is 5000 m2, not 500, therefore response category 1 is incorrect. The correct category option should be 5000 sqm. This response option is correct for GHS 2002-2008 and was flagged and corrected by Statistics SA in the GHS 2016.
Updates are delayed due to technical difficulties. How many people are staying at home? How far are people traveling when they don’t stay home? Which states and counties have more people taking trips? The Bureau of Transportation Statistics (BTS) now provides answers to those questions through our new mobility statistics. The Trips by Distance data and number of people staying home and not staying home are estimated for the Bureau of Transportation Statistics by the Maryland Transportation Institute and Center for Advanced Transportation Technology Laboratory at the University of Maryland. The travel statistics are produced from an anonymized national panel of mobile device data from multiple sources. All data sources used in the creation of the metrics contain no personal information. Data analysis is conducted at the aggregate national, state, and county levels. A weighting procedure expands the sample of millions of mobile devices, so the results are representative of the entire population in a nation, state, or county. To assure confidentiality and support data quality, no data are reported for a county if it has fewer than 50 devices in the sample on any given day. Trips are defined as movements that include a stay of longer than 10 minutes at an anonymized location away from home. Home locations are imputed on a weekly basis. A movement with multiple stays of longer than 10 minutes before returning home is counted as multiple trips. Trips capture travel by all modes of transportation. including driving, rail, transit, and air. The daily travel estimates are from a mobile device data panel from merged multiple data sources that address the geographic and temporal sample variation issues often observed in a single data source. The merged data panel only includes mobile devices whose anonymized location data meet a set of data quality standards, which further ensures the overall data quality and consistency. The data quality standards consider both temporal frequency and spatial accuracy of anonymized location point observations, temporal coverage and representativeness at the device level, spatial representativeness at the sample and county level, etc. A multi-level weighting method that employs both device and trip-level weights expands the sample to the underlying population at the county and state levels, before travel statistics are computed. These data are experimental and may not meet all of our quality standards. Experimental data products are created using new data sources or methodologies that benefit data users in the absence of other relevant products. We are seeking feedback from data users and stakeholders on the quality and usefulness of these new products. Experimental data products that meet our quality standards and demonstrate sufficient user demand may enter regular production if resources permit.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was generated for the purpose of developing unfolding methods that leverage generative machine learning models. It consists of two pieces: one piece contains events with the Standard Model (SM) production of a top-quark pair in the semi-leptonic decay mode, and the other contains events with top-quark pair production modified by a non-zero EFT operator. The SM dataset contains 15,015,000 events, and the EFT dataset contains 30,000,000. Both datasets store the following event configurations:
Each of these configurations is stored in a dedicated group as described below. Throughout, the units of energy and transverse momentum are GeV. For more details on the generation of this dataset, see Ref. [1].
Parton level data:
Particle level data:
Detector level data:
Citations:
[1] - https://arxiv.org/abs/2404.14332
[2] - https://twiki.cern.ch/twiki/bin/view/LHCPhysics/ParticleLevelTopDefinitions
Electron Drift Instrument (EDI) Q0 Survey, Level 2, 0.125 s Data (8 samples/s). EDI has two scientific data acquisition modes, called electric field mode and ambient mode. In electric field mode, two coded electron beams are emitted such that they return to the detectors after one or more gyrations in the ambient magnetic and electric field. The firing directions and times-of-flight allow the derivation of the drift velocity and electric field. In ambient mode, the electron beams are not used. The detectors with their large geometric factors and their ability to adjust the field of view quickly allow continuous sampling of ambient electrons at a selected pitch angle and fixed but selectable energy. To find the beam directions that will hit the detector, EDI sweeps each beam in the plane perpendicular to B at a fixed angular rate of 0.22 °/ms until a signal has been acquired by the detector. Once signal has been acquired, the beams are swept back and forth to stay on target. Beam detection is not determined from the changes in the count-rates directly, but from the square of the beam counts divided by the background counts from ambient electrons, i.e., from the square of the instantaneous signal-to-noise ratio (SNR). This quantity is computed from data provided by the correlator in the Gun-Detector Electronics that also generates the coding pattern imposed on the outgoing beams. If the squared SNR ratio exceeds a threshold, this is taken as evidence that the beam is returning to the detector. The thresholds for SNR are chosen dependent on background fluxes. They represent a compromise between getting false hits (induced by strong variations in background electron fluxes) and missing true beam hits. The basic software loop that controls EDI operations is executed every 2 ms. As the times when the beams hit their detectors are neither synchronized with the telemetry nor equidistant, EDI data have no fixed time-resolution. Data are reported in telemetry slots. In Survey, using the standard packing mode 0, there are eight telemetry slots per second and Gyn Detector Unit (GDU). The last beam detected during the previous slot will be reported in the current slot. If no beam has been detected, the data quality will be set to zero. In Burst telemetry there are 128 slots per second and GDU. The data in each slot consists of information regarding the beam firing directions (stored in the form of analytic gun deflection voltages), times-of-flight (if successfully measured), quality indicators, time stamps of the beam hits, and some auxiliary correlator-related information. Whenever EDI is not in electron drift mode, it uses its ambient electron mode. The mode has the capability to sample at either 90 degrees pitch angle or at 0/180 degrees (field aligned), or to alternate between 90 degrees and field aligned with selectable dwell times. While all options have been demonstrated during the commissioning phase, only the field aligned mode has been used in the routine operations phase. The choices for energy are 250 eV, 500 eV, and 1 keV. The two detectors, which are facing opposite hemispheres, are looking strictly into opposite directions, so while one detector is looking along B the other is looking antiparallel to B (corresponding to pitch angles of 180 and 0 degrees, respectively). The two detectors switch roles every half spin of the spacecraft as the tip of the magnetic field vector spins outside the field of view of one detector and into the field of view of the other detector. These data are a by-product generated from data collected in electric field mode. Whenever no return beam is found in a particular time slot by the flight software to be reported will be flagged with the lowest quality level (quality zero). The ground processing generates a separate data product from these counts data. The EDI instrument paper can be found at: http://link.springer.com/article/10.1007%2Fs11214-015-0182-7. The EDI instrument data products guide can be found at https://lasp.colorado.edu/mms/sdc/public/datasets/fields/.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This commuter mode share data shows the estimated percentages of commuters in Champaign County who traveled to work using each of the following modes: drove alone in an automobile; carpooled; took public transportation; walked; biked; went by motorcycle, taxi, or other means; and worked at home. Commuter mode share data can illustrate the use of and demand for transit services and active transportation facilities, as well as for automobile-focused transportation projects.
Driving alone in an automobile is by far the most prevalent means of getting to work in Champaign County, accounting for over 69 percent of all work trips in 2023. This is the same rate as 2019, and the first increase since 2017, both years being before the COVID-19 pandemic began.
The percentage of workers who commuted by all other means to a workplace outside the home also decreased from 2019 to 2021, most of these modes reaching a record low since this data first started being tracked in 2005. The percentage of people carpooling to work in 2023 was lower than every year except 2016 since this data first started being tracked in 2005. The percentage of people walking to work increased from 2022 to 2023, but this increase is not statistically significant.
Meanwhile, the percentage of people in Champaign County who worked at home more than quadrupled from 2019 to 2021, reaching a record high over 18 percent. It is a safe assumption that this can be attributed to the increase of employers allowing employees to work at home when the COVID-19 pandemic began in 2020.
The work from home figure decreased to 11.2 percent in 2023, but which is the first statistically significant decrease since the pandemic began. However, this figure is still about 2.5 times higher than 2019, even with the COVID-19 emergency ending in 2023.
Commuter mode share data was sourced from the U.S. Census Bureau’s American Community Survey (ACS) 1-Year Estimates, which are released annually.
As with any datasets that are estimates rather than exact counts, it is important to take into account the margins of error (listed in the column beside each figure) when drawing conclusions from the data.
Due to the impact of the COVID-19 pandemic, instead of providing the standard 1-year data products, the Census Bureau released experimental estimates from the 1-year data in 2020. This includes a limited number of data tables for the nation, states, and the District of Columbia. The Census Bureau states that the 2020 ACS 1-year experimental tables use an experimental estimation methodology and should not be compared with other ACS data. For these reasons, and because data is not available for Champaign County, no data for 2020 is included in this Indicator.
For interested data users, the 2020 ACS 1-Year Experimental data release includes a dataset on Means of Transportation to Work.
Sources: U.S. Census Bureau; American Community Survey, 2023 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (18 September 2024).; U.S. Census Bureau; American Community Survey, 2022 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (10 October 2023).; U.S. Census Bureau; American Community Survey, 2021 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (14 October 2022).; U.S. Census Bureau; American Community Survey, 2019 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (26 March 2021).; U.S. Census Bureau; American Community Survey, 2018 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (26 March 2021).; U.S. Census Bureau; American Community Survey, 2017 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (13 September 2018).; U.S. Census Bureau; American Community Survey, 2016 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (14 September 2017).; U.S. Census Bureau; American Community Survey, 2015 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (19 September 2016).; U.S. Census Bureau; American Community Survey, 2014 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2013 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2012 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2011 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2010 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2009 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2008 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2007 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2006 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2005 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a consolidated view of Official Utilisation figures across all transport modes (train, metro, bus, ferry and light rail). Opal daily tap-on/tap-off data is aggregated to a total monthly figure representing the estimated number of trips across all transport modes.\r \r Starting July 1, 2024, the methodology for calculating trip numbers for individual lines and operators will change to more accurately reflect the services our passengers use within the transport network. This new approach will apply to trains, metros, light rail, and ferries, and will soon be extended to buses. Aggregations between line, agency, and mode levels will no longer be valid, as a passenger may use multiple lines on a single trip. Trip numbers at the line, operator, or mode level should be used as reported, without further combinations.\r \r The dataset includes reports based on both the new and old methodologies, with a transition to the new method taking place over the coming months. As a result of this change, caution should be exercised when analysing longer trends that utilise both datasets. More information on NRT ROAM can be accessed\r here \r \r
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The dataset is generated with Pythia 8, with the full detector simulation being performed by Geant4 with the CLIC-like detector setup CLICdet (CLIC_o3_v14) setup. Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP. Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.
In this version of the dataset no γγ -> hadrons background is included.
This dataset contains e+e- samples with Z->ττ, ZH,H->ττ and Z->qq events, with approximately 2 million events simulated in each category.
The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 380 GeV:
The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.
The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for both generator level jets and reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.
Additionally, a set of variables describing the tau lifetime are calculated using the software in Github. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.
In summary, the features found in the flat ntuples are:
Name | Description |
reco_cand_p4s | 4-momenta per particle in the reco jet. |
reco_cand_charge | Charge per particle in the jet. |
reco_cand_pdg | PDGid per particle in the jet. |
reco_jet_p4s | RecoJet 4-momenta. |
reco_cand_dz | Longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dz_err | Uncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dxy | Transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dxy_err | Uncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
gen_jet_p4s | GenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3. |
gen_jet_tau_decaymode | Decay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used. |
gen_jet_tau_p4s | Visible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used. |
The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.3. For each reconstructed jet, we define three target values related to τ lepton reconstruction:
File | # Jets | Size |
z_test.parquet |
870 843 | 171 MB |
z_train.parquet |
3 483 369 | 681 MB |
zh_test.parquet |
1 068 606 | 213 MB |
zh_train.parquet |
4 274 423 | 851 MB |
qq_test.parquet |
6 366 715 | 1.4 GB |
qq_train.parquet |
25 466 858 | 5.6 GB |
The dataset consists of 6 files of 8.9 GB in total.
The .parquet files can be directly loaded with the Awkward Array Python library.
An example how one might use the dataset and the features is given in data_intro.ipynb