Facebook
TwitterBy VISHWANATH SESHAGIRI [source]
This dataset contains YouTube video and channel metadata to analyze the statistical relation between videos and form a topic tree. With 9 direct features, 13 more indirect features, it has all that you need to build a deep understanding of how videos are related – including information like total views per unit time, channel views, likes/subscribers ratio, comments/views ratio, dislikes/subscribers ratio etc. This data provides us with a unique opportunity to gain insights on topics such as subscriber count trends over time or calculating the impact of trends on subscriber engagement. We can develop powerful models that show us how different types of content drive viewership and identify the most popular styles or topics within YouTube's vast catalogue. Additionally this data offers an intriguing look into consumer behaviour as we can explore what drives people to watch specific videos at certain times or appreciate certain channels more than others - by analyzing things like likes per subscribers and dislikes per views ratios for example! Finally this dataset is completely open source with an easy-to-understand Github repo making it an invaluable resource for anyone looking to gain better insights into how their audience interacts with their content and how they might improve it in the future
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use This Dataset
In general, it is important to understand each parameter in the data set before proceeding with analysis. The parameters included are totalviews/channelelapsedtime, channelViewCount, likes/subscriber, views/subscribers, subscriberCounts, dislikes/views comments/subscriberchannelCommentCounts,, likes/dislikes comments/views dislikes/ subscribers totviewes /totsubsvews /elapsedtime.
To use this dataset for your own analysis:1) Review each parameter’s meaning and purpose in our dataset; 2) Get familiar with basic descriptive statistics such as mean median mode range; 3) Create visualizations or tables based on subsets of our data; 4) Understand correlations between different sets of variables or parameters; 5) Generate meaningful conclusions about specific channels or topics based on organized graph hierarchies or tables.; 6) Analyze trends over time for individual parameters as well as an aggregate reaction from all users when videos are released
Predicting the Relative Popularity of Videos: This dataset can be used to build a statistical model that can predict the relative popularity of videos based on various factors such as total views, channel viewers, likes/dislikes ratio, and comments/views ratio. This model could then be used to make recommendations and predict which videos are likely to become popular or go viral.
Creating Topic Trees: The dataset can also be used to create topic trees or taxonomies by analyzing the content of videos and looking at what topics they cover. For example, one could analyze the most popular YouTube channels in a specific subject area, group together those that discuss similar topics, and then build an organized tree structure around those topics in order to better understand viewer interests in that area.
Viewer Engagement Analysis: This dataset could also be used for viewer engagement analysis purposes by analyzing factors such as subscriber count, average time spent watching a video per user (elapsed time), comments made per view etc., so as to gain insights into how engaged viewers are with specific content or channels on YouTube. From this information it would be possible to optimize content strategy accordingly in order improve overall engagement rates across various types of video content and channel types
If you use this dataset in your research, please credit the original authors.
License
Unknown License - Please check the dataset description for more information.
File: YouTubeDataset_withChannelElapsed.csv | Column name | Description | |:----------------------------------|:-------------------------------------------------------| | totalviews/channelelapsedtime | Ratio of total views to channel elapsed time. (Ratio) | | channelViewCount | Total number of views for the channel. (Integer) | | likes/subscriber ...
Facebook
TwitterData Set Overview =================
Version 1.1 ===========
This Version 1.1 Data Set replaces the Version 1.0 Data Set (DATA_SET_ID = VG2-J-PLS-5-ION-MOM-96.0SEC) previously archived with the PDS. The old Version of the Data Set was given a new DATA_SET_ID and intentionally marked as V1.0.
Data Set Description ====================
This Data Set contains the best Estimates of the total Ion Density from Voyager 2 at Jupiter in the PLS Voltage Range (10 eV/Q to 5950 eV/Q). It is calculated using the Method of (McNutt et al., 1981) which to first order consists of taking the total measured Current and dividing by the Collector Area and Plasma Bulk Velocity. This Method is only accurate for high Mach Number Flows directly into the Detector, and may result in Underestimates of the total Density of a Factor of 2 in the outer Magnetosphere. Thus absolute Densities should be treated with caution, but Density Variations in the Data Set can be trusted. The low resolution Mode Density is used at all times except Day 190 at 2100-2200 when the Larger of the high and low resolution Mode Densities in a 96 s period is used. Corotation is assumed inside L=17.5, and a constant Velocity Component of 200 km/s into the D Cup is used outside of this. These are the Densities given in the (McNutt et al., 1981) Paper corrected by a Factor of 1.209 (0.9617) for Densities obtained from the Side (Main) Sensor. This Correction is due to a better Calculation of the effective Area of the Sensors. Data Format: Column 1 is Time (yyyy-mm-ddThh:mm:ss.sssZ), Column 2 is the Moment Density in cm^-3. Each Row has Format (a24,1x,1pe9.2). Values of -9.99e+10 indicate that the Parameter could not be obtained from the Data using the standard Analysis Technique. Additional Information about this Data Set and the Instrument which produced it can be found elsewhere in this Catalog. An Overview of the Data in this Data Set can be found in (McNutt et al., 1981) and a complete Instrument Description can be found in (Bridge et al., 1977).
+----------------------------------+
| Processing Level ID | 5 | | Software Flag | Y | +----------------------------------+
Parameters ==========
Ion Density ===========
+----------------------------------------------+
| Sampling Parameter Name | TIME | | Data Set Parameter Name | ION DENSITY | | Sampling Parameter Resolution | 96.000000 | | Sampling Parameter Interval | 96.000000 | | Minimum Available Sampling Int | 96.000000 | | Data Set Parameter Unit | CM^-3 | | Sampling Parameter Unit | SECOND | +----------------------------------------------+
A derived Parameter equaling the Number of Ions per Unit Volume over a specified Range of Ion Energy, Energy per Charge, or Energy per Nucleon. Discrimination with regard to Mass and or Charge State is necessary to obtain this Quantity, however, Mass and Charge State are often assumed due to Instrument Limitations.
Many different Forms of Ion Density are derived. Some are distinguished by their Composition (N+, Proton, Ion, etc.) or their Method of Derivation (Maxwellian Fit, Method of Moments). In some cases, more than one Type of Density will be provided in a single Data Set. In general, if more than one Ion Species is analyzed, either by Moment or Fit, a total Density will be provided which is the Sum of the Ion Densities. If a Plasma Component does not have a Maxwellian Distribution the actual Distribution can be represented as the Sum of several Maxwellians, in which case the Density of each Maxwellian is given.
Source Instrument Parameters ============================
+------------------------------------------------------------------+
| Instrument Host ID | VG2 | | Data Set Parameter Name | ION DENSITY | | Instrument Parameter Name | ION RATE | | | ION CURRENT | | | MULTIPLE PARTICLE PARAMETERS | | Important Instrument Parameters | 1 (for all Parameters) | +------------------------------------------------------------------+
Processing ==========
Processing History * ==================
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
We are excited to release a high-quality mathematical reasoning dataset designed to improve large language models on complex problem-solving tasks. This dataset was developed specifically for the AIMO competition and focuses on enhancing reasoning reliability through executable Python tool calls.
Dataset Name: AIMO3-High-Difficulty-Tool-Calling-Dataset.jsonl
Our dataset is built on Nemotron-Math-v2 and is further refined through additional filtering and processing, as outlined below:
Problem Scope Filtering (Answer Constraints) We keep only problems whose reference answers are integers within the range [0, 99999].
Difficulty Filtering (Model Pass-Rate Constraint) We keep only problems for which gpt-oss-120b, under high reasoning mode, achieves ≤ 7 correct answers out of 8 sampling attempts.
Answer Regeneration (Method Inspired by a Public Notebook) We follow the approach used in the current top-ranked public AIMO3 notebook: https://www.kaggle.com/code/nihilisticneuralnet/44-50-aimo3-skills-optional-luck-required and regenerate model answers for the retained problems.
Ongoing Expansion (More Samples for Harder Filtering) We are increasing the number of samples from 8 to 16 in order to identify and retain even more challenging problems.
Intended Use Cases This dataset can be used for Curriculum learning SFT, and reinforcement learning (RL).
Each sample follows a structured JSON schema:
{
"metadata_infos": {
"problem": "Original problem",
"standard": "Reference answer",
"data_source": "Source (e.g., AoPS competition datasets)",
"reason_high_with_tool": {
"count": "Number of samplings (8)",
"pass": "Number of correct responses",
"accuracy": "Accuracy rate"
}
},
"text": "Harmony-formatted training text"
}
The dataset is released in Harmony format. Model responses can be generated directly by running:
sh collect.sh
allowing immediate use with GPT-OSS models without additional preprocessing.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"openai/gpt-oss-120b",
trust_remote_code=True
)
input_ids = tokenizer(data["text"])["input_ids"]
| Pass Count | Samples |
|---|---|
| 1 | 662 |
| 2 | 1050 |
| 3 | 1794 |
| 4 | 612 |
| 5 | 752 |
| 6 | 950 |
| 7 | 1473 |
| Total | 7293 |
The final dataset contains approximately 70,000 reasoning trajectories.
The dataset was built through a multi-stage pipeline:
We applied different retention strategies based on pass rates:
This approach balances:
✅ reasoning diversity ✅ execution reliability ✅ problem difficulty
| Reasoning | Acc |
|---|---|
| reason_low_no_tool | 0.304077 |
| reason_low_with_tool | 0.470305 |
| reason_medium_no_tool | 0.579227 |
| reason_medium_with_tool | 0.675028 |
| reason_high_no_tool | 0.757720 |
| reason_high_with_tool | 0.794518 |
Based on the statistics presented in the table above (the data in the table is derived from the Nemotron-Math-v2 dataset), our analysis reveals a key insight:
Integrating Python tool calls into reasoning significantly improves mathematical accuracy in large language models.
Therefore, this dataset emphasizes executable reasoning rather than purely textual chains of thought.
Although sourced from public datasets, extensive curation was performed:
Value Range Filtering Ensures alignment with the AIMO3 distribution.
Pass Rate Filtering Only problems with ≤7 successful samples were retained to maintain high difficulty.
Trajectory Filtering For relatively easier problems, only trajectories...
Facebook
TwitterThis dataset details service data and operating expenses, broken out by agency and mode, reported to the National Transit Database for the 2024 report year.
NTD Data Tables organize and summarize data from the 2024 National Transit Database in a manner that is more useful for quick reference and summary analysis.
If you have any other questions about this table, please contact the NTD Help Desk at NTDHelp@dot.gov.
Facebook
TwitterData Set Overview =================
This Data Set gives the best available Values for Ion Densities, Temperatures, and Velocities near Neptune derived from Data obtained by the Voyager 2 Plasma Experiment. All Parameters are obtained by fitting the observed Spectra (Current as a Function of Energy) with Maxwellian Plasma Distributions, using a non-linear least squares fitting Routine to find the Plasma Parameters which, when coupled with the full Instrument Response, best simulate the Data. The PLS Instrument measures Energy per Charge, so Composition is not uniquely determined but can be deduced in some cases by the separation of the observed Current Peaks in Energy (assuming the Plasma is co-moving). In the upstream Solar Wind Protons are fit to the M-long Data since high Energy Resolution is needed to obtain accurate Plasma Parameters. In the Magnetosheath the Ion Flux is so low that several L-long Spectra (3-5) had to be averaged to increase the signal-to-noise Ratio to a Level at which the Data could be reliably fit. These averaged Spectra were fit using two Proton Maxwellians with the same Velocity. The Values given in the upstream Magnetosheath are the total Density and the density-weighted Temperature. In both the upstream Solar Wind and Magnetosheath full vector Velocities, Densities and Temperatures are derived for each Fit Component. In the Magnetosphere, Spectra do not contain enough Information to obtain full Velocity Vectors, so Flow is assumed to be purely azimuthal. In some cases the azimuthal Velocity is a Fit Parameter, in some cases rigid Corotation is assumed. In the "outer" Magnetosphere (L>5) two distinct Current Peaks appear in the Spectra; these are fit assuming a Composition of H+ and N+. In the inner Magnetosphere the Plasma is hot and the Composition is ambiguous, although two superimposed Maxwellians are still required to fit the Data. These Spectra are fit using two Compositions, one with H+ and N+ and the second with two H+ Components. The N+ Composition is preferred by the Data Provider. All Fit Values in the Magnetosphere come with one Sigma Errors. It should be noted that no attempt has been made to account for the Spacecraft Potential, which is probably about -10 V in this Region and will effect the Density and Velocity Values. In the outbound Magnetosheath and Solar Wind both Moment and Fit Values are given for the Velocity, Density, and Thermal Speed. The signal-to-noise Ratio in the M-longs is very low, especially near the Magnetopause, which can result in the Analysis giving incorrect Values. The L-long Spectra have too low an Energy Resolution to permit accurate Determinations Parameters in many Regions; in particular the Temperature and non-radial Velocity Components may be inaccurate.
Parameters ==========
Derived Parameters * ==================
+-------------------------------------------------------+
| Sampling Parameter Name | TIME | | Sampling Parameter Resolution | N/A | | Minimum Sampling Parameter | UNK | | Maximum Sampling Parameter | UNK | | Sampling Parameter Interval | UNK | | Minimum Available Sampling Interval | UNK | | Data Set Parameter Name | ION DENSITY | | Noise Level | UNK | | Data Set Parameter Unit | cm^-3 | +-------------------------------------------------------+
Ion Density: A derived Parameter equaling the Number of Ions per Unit Volume over a specified Range of Ion Energy, Energy per Charge, or Energy per Nucleon. Discrimination with regard to Mass and or Charge State is necessary to obtain this Quantity, however, Mass and Charge State are often assumed due to Instrument Limitations.
Many different Forms of Ion Density are derived. Some are distinguished by their Composition (N+, Proton, Ion, etc.) or their Method of Derivation (Maxwellian Fit, Method of Moments). In some cases, more than one Type of Density will be provided in a single Data Set. In general, if more than one Ion Species is analyzed, either by Moment or Fit, a total Density will be provided which is the Sum of the Ion Densities. If a Plasma Component does not have a Maxwellian Distribution the actual Distribution can be represented as the Sum of several Maxwellians, in which case the Density of each Maxwellian is given.
+-------------------------------------------------------+
| Sampling Parameter Name | TIME | | Sampling Parameter Resolution | N/A | | Minimum Sampling Parameter | UNK | | Maximum Sampling Parameter | UNK | | Sampling Parameter Interval | UNK | | Minimum Available Sampling Interval | UNK | | Data Set Parameter Name | ION TEMPERATURE | | Noise Level | UNK | | Data Set Parameter Unit | EV | +-------------------------------------------------------+
Ion Temperature: A derived Parameter giving an Indication of the Mean Energy per Ion, assuming the Shape of the Ion Energy Spectrum to be Maxwellian (i.e. highest entropy shape). Given that the Ion Energy Spectrum is not exactly Maxwellian, the Ion Temperature can be defined integrally (whereby the Mean Energy obtained by integrating under the actual Ion Energy Spectrum is set equal to the Integral under a Maxwellian, where the Temperature is a free
Facebook
TwitterGeneral For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe. Summary
A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains
30 completely labeled (segmented) images 71 partly labeled images altogether comprising 600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)
To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects A set of metrics and a novel ranking score for respective meaningful method benchmarking An evaluation of three baseline methods in terms of the above metrics and score
Abstract Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience. Dataset documentation: We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire: FISBe Datasheet Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license. Files
fisbe_v1.0_{completely,partly}.zip
contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.
fisbe_v1.0_mips.zip
maximum intensity projections of all samples, for convenience.
sample_list_per_split.txt
a simple list of all samples and the subset they are in, for convenience.
view_data.py
a simple python script to visualize samples, see below for more information on how to use it.
dim_neurons_val_and_test_sets.json
a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.
Readme.md
general information
How to work with the image files Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX. We recommend to work in a virtual environment, e.g., by using conda: conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env How to openzarr files
Install the python zarr package: pip install zarr
Opened a zarr file with: import zarrraw = zarr.open(path_to_zarr, mode='r', path="volumes/raw")seg = zarr.open(path_to_zarr, mode='r', path="volumes/gt_instances")
Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays. How to view zarr image files We recommend to use napari to view the image data.
Install napari: pip install "napari[all]"
Save the following Python script: import zarr, sys, napari raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances") viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()
Execute: python view_data.py path-to-file/R9F03-20181030_62_B5.zarr
Metrics
S: Average of avF1 and C avF1: Average F1 Score C: Average ground truth coverage clDice_TP: Average true positives clDice FS: Number of false splits FM: Number of false merges tp: Relative number of true positives
For more information on our selected metrics and formal definitions please see our paper. Baseline To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper. License The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Citation If you useFISBe in your research, please use the following BibTeX entry: @misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} } Acknowledgments We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W.Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging. Changelog There have been no changes to the dataset so far.All future change will be listed on the changelog page. Contributing If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository. All contributions are welcome!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterThe do-file marital_spouselinks.do combines all data on people's marital statuses and reported spouses to create the following datasets: 1. all_marital_reports - a listing of all the times an individual has reported their current marital status with the id numbers of the reported spouse(s); this listing is as reported so may include discrepancies (i.e. a 'Never married' status following a 'Married' one) 2. all_spouse_pairs_full - a listing of each time each spouse pair has been reported plus summary information on co-residency for each pair 3. all_spouse_pairs_clean_summarised - this summarises the data from all_spouse_pairs_full to give start and end dates of unions 4. marital_status_episodes - this combines data from all the sources to create episodes of marital status, each has a start and end date and a marital status, and if currently married, the spouse ids of the current spouse(s) if reported. There are several variables to indicate where each piece of information is coming from.
The first 2 datasets are made available in case people need the 'raw' data for any reason (i.e. if they only want data from one study) or if they wish to summarise the data in a different way to what is done for the last 2 datasets.
The do-file is quite complicated with many sources of data going through multiple processes to create variables in the datasets so it is not always straightforward to explain where each variable come from on the documentation. The 4 datasets build on each other and the do-file is documented throughout so anyone wanting to understand in great detail may be better off examining that. However, below is a brief description of how the datasets are created:
Marital status data are stored in the tables of the study they were collected in: AHS Adult Health Study [ahs_ahs1] CEN Census (initial CRS census) [cen_individ] CENM In-migration (CRS migration form) [crs_cenm] GP General form (filled for various reasons) [gp_gpform] SEI Socio-economic individual (annual survey from 2007 onwards) [css_sei] TBH TB household (study of household contacts of TB patients) [tb_tbh] TBO TB controls (matched controls for TB patients) [tb_tbo & tb_tboto2007] TBX TB cases (TB patients) [tb_tbx & tb_tbxto2007] In many of the above surveys as well as their current marital status, people were asked to report their current and past spouses along with (sometimes) some information about the marriage (start/end year etc.). These data are stored all together on the table gen_spouse, with variables indicating which study the data came from. Further evidence of spousal relationships is taken from gen_identity (if a couple appear as co-parents to a CRS member) and from crs_residency_episodes_clean_poly, a combined dataset (if they are living in the same household at the same time). Note that co-parent couples who are not reported in gen_spouse are only retained in the datasets if they have co-resident episodes.
The marital status data are appended together and the spouse id data merged in. Minimal data editing/cleaning is carried out. As the spouse data are in long format, this dataset is reshaped wide to have one line per marital status report (polygamy in the area allows for men to have multiple spouses at one time): this dataset is saved as all_marital_reports.
The list of reported spouses on gen_spouse is appended to a list of co-parents (from gen_identity) and this list is cleaned to try to identify and remove obvious id errors (incestuous links, same sex [these are not reported in this culture] and large age difference). Data reported by men and women are compared and variables created to show whether one or both of the couple report the union. Many records have information on start and end year of marriage, and all have the date the union was reported. This listing is compared to data from residency episodes to add dates that couples were living together (not all have start/end dates so this is to try to supplement this), in addition the dates that each member of the couple was last known to be alive or first known to be dead are added (from the residency data as well). This dataset with all the records available for each spouse pair is saved as all_spouse_pairs_full.
The date data from all_spouse_pairs_full are then summarised to get one line per couple with earliest and latest known married date for all, and, if available, marriage and separation date. For each date there are also variables created to indicate the source of the data.
As culture only allows for women having one spouse at a time, records for women with 'overlapping' husbands are cleaned. This dataset is then saved as all_spouse_pairs_clean_summarised.
Both the cleaned spouse pairs and the cleaned marital status datasets are converted into episodes: the spouse listing using the marriage or first known married date as the beginning and the last known married plus a year or separation date as the end, the marital status data records collapsed into periods of the same status being reported (following some cleaning to remove impossible reports) and the start date being the first of these reports, the end date being the last of the reports plus a year. These episodes are appended together and a series of processes run several times to remove overalapping episodes. To be able to assign specific spouse ids to each married episode, some episodes need to be 'split' into more than one (i.e. if a man is married to one woman from 2005 to 2017 and then marries another woman in 2008 and remains married to her till 2017 his intial married episode would be from 2005 to 2017, but this would need to be split into one from 2005 to 2008 which would just have 1 idspouse attached and another from 2008 to 2017, which would have 2 idspouse attached). After this splitting process the spouse ids are merged in.
The final episode dataset is saved as marital_status_episodes.
Individual
Face-to-face [f2f]
Facebook
TwitterNational, regional
Households
Sample survey data [ssd]
The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.
Computer Assisted Telephone Interview [cati]
The questionnaire for Round 2 consisted of the following sections
Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES
Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps:
• Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese.
• Remove unnecessary variables which were automatically calculated by SurveyCTO
• Remove household duplicates in the dataset where the same form is submitted more than once.
• Remove observations of households which were not supposed to be interviewed following the identified replacement procedure.
• Format variables as their object type (string, integer, decimal, etc.)
• Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer.
• Correct data based on supervisors’ note where enumerators entered wrong code.
• Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings.
• Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form.
• Label variables using the full question text.
• Label variable values where necessary.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This training dataset was calculated using the mechanistic modeling approach. See the “Benchmark Synthetic Training Data for Artificial Intelligence-based Li-ion Diagnosis and Prognosis“ publication for mode details. More details will be added when published. The prognosis dataset was harder to define as there are no limits on how the three degradation modes can evolve. For this proof of concept work, we considered eight parameters to scan. For each degradation mode, degradation was chosen to follow equation (1).
%degradation=a × cycle+ (exp^(b×cycle)-1) (1)
Considering the three degradation modes, this accounts for six parameters to scan. In addition, two other parameters were added, a delay for the exponential factor for LLI, and a parameter for the reversibility of lithium plating. The delay was introduced to reflect degradation paths where plating cannot be explained by an increase of LAMs or resistance [55]. The chosen parameters and their values are summarized in Table S1 and their evolution is represented in Figure S1. Figure S1(a,b) presents the evolution of parameters p1 to p7. At the worst, the cells endured 100% of one of the degradation modes in around 1,500 cycles. Minimal LLI was chosen to be 20% after 3,000 cycles. This is to guarantee at least 20% capacity loss for all the simulations. For the LAMs, conditions were less restrictive, and, after 3,000 cycles, the lowest degradation is of 3%. The reversibility factor p8 was calculated with equation (2) when LAMNE > PT.
%LLI=%LLI+p8 (LAM_PE-PT) (2)
Where PT was calculated with equation (3) from [60].
PT=100-((100-LAMPE)/(100×LRini-LAMPE ))×(100-OFSini-LLI) (3)
Varying all those parameters accounted for more than 130,000 individual duty cycles. With one voltage curve for every 100 cycles. 6 MATLAB© .mat files are included: The GIC-LFP_duty_other.mat file contains 12 variables Qnorm: normalize capacity scale for all voltage curves
P1 to p8: values used to generate the duty cycles
Key: index for which values were used for each degradation paths. 1 -p1, … 8 - p8
QL: capacity loss, one line per path, one column per 100 cycles.
File GIC-LFP_duty_LLI-LAMsvalues.mat contains the values for LLI, LAMPE and LAMNE for all cycles (1line per 100 cycles) and duty cycles (columns).
Files GIC-LFP_duty_1 to _4 files contains the voltage data split into 1GB chunks (40,000 simulations). Each cell corresponds to 1 line in the key variable. Inside each cell, one colunm per 100 cycles.
Facebook
TwitterSearch Coil Magnetometer (SCM) AC Magnetic Field (16384 samples/s), Level 2, High Speed Burst Mode Data. The tri-axial Search-Coil Magnetometer with its associated preamplifier measures three-dimensional magnetic field fluctuations. The analog magnetic waveforms measured by the SCM are digitized and processed inside the Digital Signal Processor (DSP), collected and stored by the Central Instrument Data Processor (CIDP) via the Fields Central Electronics Box (CEB). Prior to launch, all SCM Flight models were calibrated by LPP team members at the National Magnetic Observatory, Chambon-la-Foret (Orleans). Once per orbit, each SCM transfer function is checked thanks to the onboard calibration signal provided by the DSP. The SCM is operated for the entire MMS orbit in survey mode. Within scientific Regions Of Interest (ROI), burst mode data are also acquired as well as high speed burst mode data. This SCM data set corresponds to the AC magnetic field waveforms in nanoTesla and in the GSE frame. The SCM instrument paper for SCM can be found at http://link.springer.com/article/10.1007/s11214-014-0096-9 and the SCM data product guide at https://lasp.colorado.edu/mms/sdc/public/datasets/fields/.
Facebook
TwitterData Set Overview =================
Version 1.1 ===========
This Version 1.1 Data Set replaces the Version 1.0 Data Set (DATA_SET_ID = VG1-J-PLS-5-ION-MOM-96.0SEC) previously archived with the PDS.
Data Set Description ====================
This Data Set contains the best Estimates of the total Ion Density at Jupiter during the Voyager 1 Encounter in the PLS Voltage Range (10 eV/Q to 5950 eV/Q). It is calculated using the Method of (McNutt et al., 1981) which to First Order consists of taking the Total measured Current and dividing by the Collector Area and Plasma Bulk Velocity. This Method is only accurate for high Mach Number flows directly into the Detector, and may result in Underestimates of the total Density of a Factor of 2 in the outer Magnetosphere. Thus absolute Densities should be treated with Caution, but Density Variations in the Data Set can be trusted. The low resolution Mode Density is used before the Year 1979, Day 63 at 1300, after this the Larger of the high and low resolution Mode Densities in a 96 s Period is used since the L-mode Spectra often are saturated. Corotation is assumed inside L=17.5, and a constant Velocity Component of 200 km/s into the D Cup is used outside of this. These are the Densities given in (McNutt et al., 1981) corrected by a factor of 1.209 (0.9617) for Densities obtained from the Side (main) Sensor. This Correction is due to a better Calculation of the Effective Area of the Sensors. Data Format: Column 1 is Time (yyyy-mm-ddThh:mm:ss.sssZ), Column 2 is the Moment Density in cm^-3. Each Row has Format (a24, 1x, 1pe9.2). Values of -9.99e+10 indicate that the Parameter could not be obtained from the Data using the standard Analysis Technique. Additional Information about this Data Set and the Instrument which produced it can be found elsewhere in this Catalog. An Overview of the Data in this Data Set can be found in (McNutt et al., 1981) and a complete Instrument Description can be found in (Bridge et al., 1977).
+----------------------------------+
| Processing Level ID | 5 | | Software Flag | Y | +----------------------------------+
Parameters ==========
Ion Density ===========
+----------------------------------------------+
| Sampling Parameter Name | TIME | | Data Set Parameter Name | ION DENSITY | | Sampling Parameter Resolution | 96.000000 | | Sampling Parameter Interval | 96.000000 | | Minimum Available Sampling Int | 96.000000 | | Data Set Parameter Unit | EV | | Sampling Parameter Unit | SECOND | +----------------------------------------------+
A derived Parameter equaling the Number of Ions per Unit Volume over a specified Range of Ion Energy, Energy per Charge, or Energy per Nucleon. Discrimination with regard to Mass and or Charge State is necessary to obtain this Quantity, however, Mass and Charge State are often assumed due to Instrument Limitations.
Many different Forms of Ion Density are derived. Some are distinguished by their Composition (N+, Proton, Ion, etc.) or their Method of Derivation (Maxwellian Fit, Method of Moments). In some cases, more than one Type of Density will be provided in a single Data Set. In general, if more than one Ion Species is analyzed, either by Moment or Fit, a total Density will be provided which is the Sum of the Ion Densities. If a Plasma Component does not have a Maxwellian Distribution the actual Distribution can be represented as the Sum of several Maxwellians, in which case the Density of each Maxwellian is given.
Source Instrument Parameters ============================
+------------------------------------------------------------------+
| Instrument Host ID | VG1 | | Data Set Parameter Name | ION DENSITY | | Instrument Parameter Name | ION RATE | | | ION CURRENT | | | PARTICLE MULTIPLE PARAMETERS | | Important Instrument Parameters | 1 (for all Parameters) | +------------------------------------------------------------------+
Processing ==========
Processing History ==================
+----------------------------------------------------------+
| Source Data Set ID | VG1-PLS | | Software | MOMANAL | | Product Data Set ID | VG1-J-PLS-5-ION-MOM-96.0SEC | +----------------------------------------------------------+
Software MOMANAL * ================
Software
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This commuter mode share data shows the estimated percentages of commuters in Champaign County who traveled to work using each of the following modes: drove alone in an automobile; carpooled; took public transportation; walked; biked; went by motorcycle, taxi, or other means; and worked at home. Commuter mode share data can illustrate the use of and demand for transit services and active transportation facilities, as well as for automobile-focused transportation projects.
Driving alone in an automobile is by far the most prevalent means of getting to work in Champaign County, accounting for about 64 percent of all work trips in 2024. This is a statistically significant decrease since 2023, which was the first year that matched pre-COVID-19 pandemic levels of driving alone.
The percentage of workers who commuted by all other means to a workplace outside the home also decreased from 2019 to 2021, most of these modes reaching a record low since this data first started being tracked in 2005. All of these modes except public transportation saw increases from 2023 to 2024, but they were not statistically significant.
Meanwhile, the percentage of people in Champaign County who worked at home more than quadrupled from 2019 to 2021, reaching a record high over 18 percent. It is a safe assumption that this can be attributed to the increase of employers allowing employees to work at home when the COVID-19 pandemic began in 2020.
The work from home figure decreased to 11.2 percent in 2023, but which is the first statistically significant decrease since the pandemic began. However, this figure saw a statistically significant increase from 2023 to 2024, rising back from 15.1 percent in 2024. This figure is about 3.3 times higher than 2019, despite the COVID-19 emergency ending in 2023.
Commuter mode share data was sourced from the U.S. Census Bureau’s American Community Survey (ACS) 1-Year Estimates, which are released annually.
As with any datasets that are estimates rather than exact counts, it is important to take into account the margins of error (listed in the column beside each figure) when drawing conclusions from the data.
Due to the impact of the COVID-19 pandemic, instead of providing the standard 1-year data products, the Census Bureau released experimental estimates from the 1-year data in 2020. This includes a limited number of data tables for the nation, states, and the District of Columbia. The Census Bureau states that the 2020 ACS 1-year experimental tables use an experimental estimation methodology and should not be compared with other ACS data. For these reasons, and because data is not available for Champaign County, no data for 2020 is included in this Indicator.
For interested data users, the 2020 ACS 1-Year Experimental data release includes a dataset on Means of Transportation to Work.
Sources: U.S. Census Bureau; American Community Survey, 2024 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (19 November 2024).; U.S. Census Bureau; American Community Survey, 2023 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (18 September 2024).; U.S. Census Bureau; American Community Survey, 2022 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (10 October 2023).; U.S. Census Bureau; American Community Survey, 2021 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (14 October 2022).; U.S. Census Bureau; American Community Survey, 2019 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (26 March 2021).; U.S. Census Bureau; American Community Survey, 2018 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (26 March 2021).; U.S. Census Bureau; American Community Survey, 2017 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (13 September 2018).; U.S. Census Bureau; American Community Survey, 2016 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (14 September 2017).; U.S. Census Bureau; American Community Survey, 2015 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (19 September 2016).; U.S. Census Bureau; American Community Survey, 2014 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2013 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2012 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2011 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2010 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2009 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2008 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2007 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2006 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2005 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).
Facebook
TwitterThe Advanced Land Observing Satellite-2 (ALOS-2), launched May 14, 2014, is a follow-on mission to ALOS. ALOS-2 features an imaging microwave radar, PALSAR-2 (Phased Array type L-band Synthetic Aperture Radar-2). With a wider range of observation modes PALSAR-2 improves on its predecessor PALSAR (which flew onboard ALOS). PALSAR-2 observes in the L-band, specifically 1257.5 MHz which is adjustable by ± 21 MHz. ScanSAR mode provides a spatial resolution of 60 m and 100 m for a swath of 490 km and 350 km respectively. Stripmap mode provides a resolution of 10 m, 6 m, and 3 m for a swath of 70 km, 70 km, and 50 km respectively. Spotlight mode provides a resolution of 1 m x 3 m for a swath of 25 km x 25 km. ALOS-2 also decreased the revisit time of its predecessor from 46 days to 14 days. ALOS-2 also has the ability to look either to the right or the left. This collection provides open access to ScanSAR mode data acquired by ALOS-2. Recently acquired data are not included but are added when released for open access by JAXA. Although the ScanSAR mode has been acquired globally, the collection currently contains only partial global coverage, but coverage is increasing as data become available. Products have been processed by JAXA as Level 1.1, such that range and single look azimuth compressed data is represented by complex I and Q channels to preserve the magnitude and phase information. The range coordinate is in slant range. The image is focused onto zero Doppler direction, and an image file is generated per each scan for ScanSAR mode. In addition the full aperture processing method was used to generate the products. With the full aperture method, range compression and one look azimuth compression are performed for the data whose gaps between neighboring bursts in a sub swath are filled with zeroes, this processing is performed for each scan and each polarization. Granules in this collection are distributed as a zip and are quite large, with the majority being 28GB (single polarization) or 56GB (dual polarization).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DFEI dataset
The full description can also be found in README.md.
The dataset was used in the paper “GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions”. The project describes a full event interpretation at the LHCb experiment, situated at the Large Hadron Collider in CERN, Geneva. An “event” consists of detector responses that were converted to tracks - each track represents a particle.
The aim of the algorithm is to make sense of the tracks and bundle together tracks coming from the same origin, as well as interpreting their decay hierarchy.
Generated events
The events in this dataset are based on simulation generated with PYTHIA8 and EvtGen, in which the particle-collision conditions expected for the LHC Run 3 are replicated as shown in the table.
LHCb period
Num. vis. pp collisions
Num. tracks
Num. b hadrons
Num. c hadrons
Runs 3-4 (Upgrade I)
∼ 5
∼ 150
≪ 1
∼ 1
Additionally, an approximate emulation of the LHCb detection and reconstruction effects is applied, as described in the paper in the appendix “Simulation”. In the generated dataset, each event is required to contain at least one b-hadron, which is subsequently allowed to decay freely through any of the standard decay modes present in PYTHIA8. On average, 40% of those events contain more than one b-hadron decay, with a maximum b-hadron decay multiplicity of five. Only charged stable particles that have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region (as defined in the paper) are included in the datasets.
Datasets
The datasets are divided in three categories
Training and testing
The file Dataset_InclusiveHb_Training.root contains the training dataset (40,000 events) test dataset (10,000 events) of inclusive decays.
Evaluation
The inclusive dataset Dataset_InclusiveHb_Evaluation.root contains the evaluation events (50,000).
Exclusive decays
In addition to this inclusive dataset, several other smaller samples (of few thousand events each) have also been generated, requiring that all the events in each sample contained a specific (exclusive) type of b-hadron decay. The specific modes have been chosen to be representative of the most common classes of decay topologies of physics interest for LHCb. These samples contain only events in which all the particles originating from each of the considered exclusive decays have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region.
The datasets contained are:
Dataset_Bd_DD.root
Dataset_Bd_Kpi.root
Dataset_Bd_Kstmumu.root
Dataset_Bs_Dspi.root
Dataset_Bs_Jpsiphi.root
Dataset_Bu_KKpi.root
Dataset_Lb_Lcpi.root
More information on them can be found in the paper.
Loading the data
The dataset is saved in the binary ROOT format with a key-array mapping. It can be loaded using the uproot Python library to convert it to a pandas DataFrame or similar.
An example snippet is given here:
import uproot
treename = "Relations"
with uproot.open('/path/to/file.root') as file:
df = file[treename].arrays(
# we can specify only a set of branches
# ['EventNumber', "FromSamePV_true"],
library='pd') # 'pd' for pandas
The returned file behaves like a mapping that contains two different data holders. They are accessible with Relations or Particles that contain either the relations between the particles or the particles themselves.
Regarding the Relations, only edges connecting two different particles are contained in the dataset. The edges are treated as not directional, so a single edge is considered for each pair of particles.
Variables
The relevant features used in the GNN are described in the following. A cartesian right-handed coordinate system is used, with the z axis pointing along the beamline, the x axis beinng parallel to the horizontal and the y axis being vertically oriented. When specified in the name of the variables, the suffix “_true” refers to ground-truth information, and the suffix “_reco” refers to the output of the emulated LHCb reconstruction.
General:
EventNumber: unique number to identify the event that the entry belongs to.
Node variables:
ParticleKey: unique number to identify each particle in a given event.
Identity (ID): numerical code identifying the type of particle, following the Monte Carlo Particle Numbering Scheme.
FromPrimaryBeautyHadron: boolean variable indicating whether the particles has been produced in a beauty hadron decay or not.
Transverse momentum (pT): component of the three-momentum transverse to the beamline, i.e. the x and y component combined.
Impact parameter with respect to the associated primary vertex (IP): distance of closest approach between the particle trajectory and its associated primary vertex (proton-proton collision point), defined as the one with the smallest IP for the given particle amongst all the primary vertices in the event.
Pseudorapidity (η): spatial coordinate describing the angle of a particle relative to the beam axis, computed as η = arctanh(pz/∥p⃗∥).
Charge (q): for the stable particles under consideration, the charge can take the value 1 or -1.
Ox, Oy, Oz: cartesian coordinates of the origin point of the particle.
px, py, pz: cartesian coordinates of the three-momentum.
PVx, PVy, PVz: cartesian coordinates of the position of the associated primary vertex.
Edge variables:
FirstParticleKey: ParticleKey of one of the two particles connected by the edge.
SecondParticleKey: ParticleKey of the other particle, verifying FirstParticleKey > SecondParticleKey.
FromSamePrimaryBeautyHadron: boolean variable indicating whether the two particles originate from the same beauty hadron decay.
Opening angle (θ): angle between the three-momentum directions of the two particles.
Momentum-transverse distance (d ⊥ P⃗): distance between the origin point of the two particles defined on a plane which is transverse to the combined three momentum of the two particles.
Distance along the beam axis (Δz): difference between the z-coordinate of the origin points of the two particles.
FromSamePV: boolean variable indicating whether the two particles share the same associated primary vertex.
Order of the “topological” Lowest Common Ancestor (TopoLCAOrder): variable that can take the values 0, 1, 2 or 3, as explained in the paper.
Identity of the “topological” Lowest Common Ancestor (TopoLCAID): numerical code identifying the particle type of the ancestor, following the Monte Carlo Particle Numbering Scheme.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Modal Service data and Safety & Security (S&S) public transit time series data delineated by transit/agency/mode/year/month. Includes all Full Reporters--transit agencies operating modes with more than 30 vehicles in maximum service--to the National Transit Database (NTD). This dataset will be updated monthly.
The monthly ridership data is released one month after the month in which the service is provided. Records with null monthly service data reflect late reporting.
The S&S statistics provided include both Major and Non-Major Events where applicable. Events occurring in the past three months are excluded from the corresponding monthly ridership rows in this dataset while they undergo validation. This dataset is the only NTD publication in which all Major and Non-Major S&S data are presented without any adjustment for historical continuity.
Facebook
TwitterThe GHS is an annual household survey specifically designed to measure the living circumstances of South African households. The GHS collects data on education, employment, health, housing and household access to services.
The survey is representative at national level and at provincial level.
Households and individuals
The survey covered all de jure household members (usual residents) of households in the nine provinces of South Africa and residents in workers' hostels. The survey does not cover collective living quarters such as students' hostels, old age homes, hospitals, prisons and military barracks.
Sample survey data
A multi-stage, stratified random sample was drawn using probability-proportional-to-size principles. First level stratification was based on province and second-tier stratification on district council. The GHS 2009 represents the second year of a new master sample (the first year was GHS 2008) that will be used until 2010.
Face-to-face [f2f]
GHS uses questionnaires as data collection instruments
The questionnaire for the General Household Survey has undergone various changes since 2002. Significant changes were made to the GHS 2009 questionnaire and this should be borne in mind when comparing across different datasets. See GHS 2009 statistical release for a detailed report on important differences between the questionnaires.
In GHS 2009-2010:
The variable on care provision (Q129acre) in the GHS 2009 and 2010 should be used with caution. The question to collect the data (question 1.29a) asks:
"Does anyone in this household personally provide care for at least two hours per day to someone in the household who - owing to frailty, old age, disability, or ill-health cannot manage without help?"
Response codes (in the questionnaire, metadata, and dataset) are:
1 = No 2 = Yes, 2-19 hours per week 3 = Yes, 20-49 hours per week 4 = Yes, 50 + hours per week 5 = Do not know
There is inconsistency between the question, which asks about hours per day, and the response options, which record hours per week. The outcome that a respondent who gives care for one hour per day (7 hours/week) would presumably not answer this question. Someone giving care for 13 hours a week would also be excluded as though they do that do serious caregiving, which is incorrect.
In GHS 2009-2015:
The variable on land size in the General Household Survey questionnaire for 2009-2015 should be used with caution. The data comes from questions on the households' agricultural activities in Section 8 of the GHS questionnaire: Household Livelihoods: Agricultural Activities. Question 8.8b asks:
“Approximately how big is the land that the household use for production? Estimate total area if more than one piece.” One of the response category is worded as:
1 = Less than 500m2 (approximately one soccer field)
However, a soccer field is 5000 m2, not 500, therefore response category 1 is incorrect. The correct category option should be 5000 sqm. This response option is correct for GHS 2002-2008 and was flagged and corrected by Statistics SA in the GHS 2016.
Facebook
TwitterPSP FIELDS Digital Fields Board, DFB, Single Ended Voltage data:
The DFB is the low frequency, less than 75 kHz, component of the FIELDS experiment on the Parker Solar Probe spacecraft, see reference [1] below. For a full description of the FIELDS experiment, see reference [2]. For a description of the DFB, see reference [3].
The DFB continuous waveform data consist of time series data from various FIELDS sensors. These data have been filtered by both analog and digital filters [3].
The Level 2 data products contained in this data file have been calibrated for:
1) DFB in-band gain 2) DFB analog filter gain and phase response 3) DFB digital filter phase response 4) The search coil preamplifier gain and phase response, when applicable
Calibrations for the DFB digital filter gain response have not been implemented, but the required convolution kernel is provided in this file. It was decided not to apply the digital filter gain response calibration to these Level 2 data because the application of the calibrations can introduce non-physical power at high frequencies when the non-corrected signal is dominated by noise. This effect should be examined carefully when determining spectral slopes and features at the highest frequencies. Calibrations for the FIELDS voltage sensor preamplifiers have not been implemented as the preamplifier response is flat and equal to one throughout the DFB frequency range. Corrections for plasma sheath impedance gain and antenna effective length have not been applied to the voltage sensor data. These corrections will be applied in the Level 3 DFB data products. Therefore, all voltage sensor quantities when present in these Level 2 data products are expressed by using units of Volts. Likewise, all magnetic field quantities when present in these Level 2 data products are expressed by using units of nanoTelsas.
The Level 2 data products contained in this data file are in spacecraft coordinates (e.g. x,y,z) or in sensor coordinates (e.g. dV12, dV34 for voltage measurements and [u,v,w] for the searchcoil magnetometer.
The time resolution of the DFB continuous waveform data can vary by multiples of 2^N. During encounter when PSP is within 0.25 AU of the Sun, the DFB continuous waveform data cadence is typically 256 samples/NYsecond [2].
References:
1) Fox, N.J., Velli, M.C., Bale, S.D. et al., Space Sci Rev (2016) 204:7. https://doi.org/10.1007/s1121401502116 2) Bale, S.D., Goetz, K., Harvey, P.R. et al., Space Sci Rev (2016) 204:49. https://doi.org/10.1007/s1121401602445 * 3) Malaspina, D.M., Ergun, R.E., Bolton, M. et al., JGR Space Physics (2016), 121, 5088-5096. https://doi.org/10.1002/2016JA022344
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.
https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">
Facebook
TwitterDistributed fiber optic sensing was an important part of the monitoring system for EGS Collab Experiment #2. A single loop of custom fiber package was grouted into the four monitoring boreholes that bracketed the experiment volume. This fiber package contained two multi-mode fibers and four single-mode fibers. These fibers were connected to an array of fiber optic interrogator units, each targeting a different measurement. The distributed temperature system (DTS) consisted of a Silixa XT-DTS unit, connected to both ends of one of the two multi-mode fibers. This system measured absolute temperature along the entire length of fiber for the duration of the experiment at a sampling rate of approximately 10 minutes. This dataset includes both raw data in XML format from the XT-DTS, as well as a processed dataset with the sections of data pertaining only to the boreholes are extracted. We have also included a report that provides all of the relevant details necessary for users to process and interpret the data for themselves. Please read this accompanying report. If, after reading it, there are still outstanding questions, please do not hesitate to contact us. Happy processing.
Facebook
TwitterBy VISHWANATH SESHAGIRI [source]
This dataset contains YouTube video and channel metadata to analyze the statistical relation between videos and form a topic tree. With 9 direct features, 13 more indirect features, it has all that you need to build a deep understanding of how videos are related – including information like total views per unit time, channel views, likes/subscribers ratio, comments/views ratio, dislikes/subscribers ratio etc. This data provides us with a unique opportunity to gain insights on topics such as subscriber count trends over time or calculating the impact of trends on subscriber engagement. We can develop powerful models that show us how different types of content drive viewership and identify the most popular styles or topics within YouTube's vast catalogue. Additionally this data offers an intriguing look into consumer behaviour as we can explore what drives people to watch specific videos at certain times or appreciate certain channels more than others - by analyzing things like likes per subscribers and dislikes per views ratios for example! Finally this dataset is completely open source with an easy-to-understand Github repo making it an invaluable resource for anyone looking to gain better insights into how their audience interacts with their content and how they might improve it in the future
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use This Dataset
In general, it is important to understand each parameter in the data set before proceeding with analysis. The parameters included are totalviews/channelelapsedtime, channelViewCount, likes/subscriber, views/subscribers, subscriberCounts, dislikes/views comments/subscriberchannelCommentCounts,, likes/dislikes comments/views dislikes/ subscribers totviewes /totsubsvews /elapsedtime.
To use this dataset for your own analysis:1) Review each parameter’s meaning and purpose in our dataset; 2) Get familiar with basic descriptive statistics such as mean median mode range; 3) Create visualizations or tables based on subsets of our data; 4) Understand correlations between different sets of variables or parameters; 5) Generate meaningful conclusions about specific channels or topics based on organized graph hierarchies or tables.; 6) Analyze trends over time for individual parameters as well as an aggregate reaction from all users when videos are released
Predicting the Relative Popularity of Videos: This dataset can be used to build a statistical model that can predict the relative popularity of videos based on various factors such as total views, channel viewers, likes/dislikes ratio, and comments/views ratio. This model could then be used to make recommendations and predict which videos are likely to become popular or go viral.
Creating Topic Trees: The dataset can also be used to create topic trees or taxonomies by analyzing the content of videos and looking at what topics they cover. For example, one could analyze the most popular YouTube channels in a specific subject area, group together those that discuss similar topics, and then build an organized tree structure around those topics in order to better understand viewer interests in that area.
Viewer Engagement Analysis: This dataset could also be used for viewer engagement analysis purposes by analyzing factors such as subscriber count, average time spent watching a video per user (elapsed time), comments made per view etc., so as to gain insights into how engaged viewers are with specific content or channels on YouTube. From this information it would be possible to optimize content strategy accordingly in order improve overall engagement rates across various types of video content and channel types
If you use this dataset in your research, please credit the original authors.
License
Unknown License - Please check the dataset description for more information.
File: YouTubeDataset_withChannelElapsed.csv | Column name | Description | |:----------------------------------|:-------------------------------------------------------| | totalviews/channelelapsedtime | Ratio of total views to channel elapsed time. (Ratio) | | channelViewCount | Total number of views for the channel. (Integer) | | likes/subscriber ...