This table presents income shares, thresholds, tax shares, and total counts of individual Canadian tax filers, with a focus on high income individuals (95% income threshold, 99% threshold, etc.). Income thresholds are based on national threshold values, regardless of selected geography; for example, the number of Nova Scotians in the top 1% will be calculated as the number of taxfiling Nova Scotians whose total income exceeded the 99% national income threshold. Different definitions of income are available in the table namely market, total, and after-tax income, both with and without capital gains.
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the survey of income and program participation (sipp) with r if the census bureau's budget was gutted and only one complex sample survey survived, pray it's the survey of income and program participation (sipp). it's giant. it's rich with variables. it's monthly. it follows households over three, four, now five year panels. the congressional budget office uses it for their health insurance simulation . analysts read that sipp has person-month files, get scurred, and retreat to inferior options. the american community survey may be the mount everest of survey data, but sipp is most certainly the amazon. questions swing wild and free through the jungle canopy i mean core data dictionary. legend has it that there are still species of topical module variables that scientists like you have yet to analyze. ponce de león would've loved it here. ponce. what a name. what a guy. the sipp 2008 panel data started from a sample of 105,663 individuals in 42,030 households. once the sample gets drawn, the census bureau surveys one-fourth of the respondents every four months, over f our or five years (panel durations vary). you absolutely must read and understand pdf pages 3, 4, and 5 of this document before starting any analysis (start at the header 'waves and rotation groups'). if you don't comprehend what's going on, try their survey design tutorial. since sipp collects information from respondents regarding every month over the duration of the panel, you'll need to be hyper-aware of whether you want your results to be point-in-time, annualized, or specific to some other period. the analysis scripts below provide examples of each. at every four-month interview point, every respondent answers every core question for the previous four months. after that, wave-specific addenda (called topical modules) get asked, but generally only regarding a single prior month. to repeat: core wave files contain four records per person, topical modules contain one. if you stacked every core wave, you would have one record per person per month for the duration o f the panel. mmmassive. ~100,000 respondents x 12 months x ~4 years. have an analysis plan before you start writing code so you extract exactly what you need, nothing more. better yet, modify something of mine. cool? this new github repository contains eight, you read me, eight scripts: 1996 panel - download and create database.R 2001 panel - download and create database.R 2004 panel - download and create database.R 2008 panel - download and create database.R since some variables are character strings in one file and integers in anoth er, initiate an r function to harmonize variable class inconsistencies in the sas importation scripts properly handle the parentheses seen in a few of the sas importation scripts, because the SAScii package currently does not create an rsqlite database, initiate a variant of the read.SAScii
function that imports ascii data directly into a sql database (.db) download each microdata file - weights, topical modules, everything - then read 'em into sql 2008 panel - full year analysis examples.R< br /> define which waves and specific variables to pull into ram, based on the year chosen loop through each of twelve months, constructing a single-year temporary table inside the database read that twelve-month file into working memory, then save it for faster loading later if you like read the main and replicate weights columns into working memory too, merge everything construct a few annualized and demographic columns using all twelve months' worth of information construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half, again save it for faster loading later, only if you're so inclined reproduce census-publish ed statistics, not precisely (due to topcoding described here on pdf page 19) 2008 panel - point-in-time analysis examples.R define which wave(s) and specific variables to pull into ram, based on the calendar month chosen read that interview point (srefmon)- or calendar month (rhcalmn)-based file into working memory read the topical module and replicate weights files into working memory too, merge it like you mean it construct a few new, exciting variables using both core and topical module questions construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half reproduce census-published statistics, not exactly cuz the authors of this brief used the generalized variance formula (gvf) to calculate the margin of error - see pdf page 4 for more detail - the friendly statisticians at census recommend using the replicate weights whenever possible. oh hayy, now it is. 2008 panel - median value of household assets.R define which wave(s) and spe cific variables to pull into ram, based on the topical module chosen read the topical module and replicate weights files into working memory too, merge once again construct a replicate-weighted complex sample design with a...
https://qdr.syr.edu/policies/qdr-restricted-access-conditionshttps://qdr.syr.edu/policies/qdr-restricted-access-conditions
Project Summary This dataset contains all qualitative and quantitative data collected in the first phase of the Pandemic Journaling Project (PJP). PJP is a combined journaling platform and interdisciplinary, mixed-methods research study developed by two anthropologists, with support from a team of colleagues and students across the social sciences, humanities, and health fields. PJP launched in Spring 2020 as the COVID-19 pandemic was emerging in the United States. PJP was created in order to “pre-design an archive” of COVID-19 narratives and experiences open to anyone around the world. The project is rooted in a commitment to democratizing knowledge production, in the spirit of “archival activism” and using methods of “grassroots collaborative ethnography” (Willen et al. 2022; Wurtz et al. 2022; Zhang et al 2020; see also Carney 2021). The motto on the PJP website encapsulates these commitments: “Usually, history is written only by the powerful. When the history of COVID-19 is written, let’s make sure that doesn’t happen.” (A version of this Project Summary with links to the PJP website and other relevant sites is included in the public documentation of the project at QDR.) In PJP’s first phase (PJP-1), the project provided a digital space where participants could create weekly journals of their COVID-19 experiences using a smartphone or computer. The platform was designed to be accessible to as wide a range of potential participants as possible. Anyone aged 15 or older, living anywhere in the world, could create journal entries using their choice of text, images, and/or audio recordings. The interface was accessible in English and Spanish, but participants could submit text and audio in any language. PJP-1 ran on a weekly basis from May 2020 to May 2022. Data Overview This Qualitative Data Repository (QDR) project contains all journal entries and closed-ended survey responses submitted during PJP-1, along with accompanying descriptive and explanatory materials. The dataset includes individual journal entries and accompanying quantitative survey responses from more than 1,800 participants in 55 countries. Of nearly 27,000 journal entries in total, over 2,700 included images and over 300 are audio files. All data were collected via the Qualtrics survey platform. PJP-1 was approved as a research study by the Institutional Review Board (IRB) at the University of Connecticut. Participants were introduced to the project in a variety of ways, including through the PJP website as well as professional networks, PJP’s social media accounts (on Facebook, Instagram, and Twitter) , and media coverage of the project. Participants provided a single piece of contact information — an email address or mobile phone number — which was used to distribute weekly invitations to participate. This contact information has been stripped from the dataset and will not be accessible to researchers. PJP uses a mixed-methods research approach and a dynamic cohort design. After enrolling in PJP-1 via the project’s website, participants received weekly invitations to contribute to their journals via their choice of email or SMS (text message). Each weekly invitation included a link to that week’s journaling prompts and accompanying survey questions. Participants could join at any point, and they could stop participating at any point as well. They also could stop participating and later restart. Retention was encouraged with a monthly raffle of three $100 gift cards. All individuals who had contributed that month were eligible. Regardless of when they joined, all participants received the project’s narrative prompts and accompanying survey questions in the same order. In Week 1, before contributing their first journal entries, participants were presented with a baseline survey that collected demographic information, including political leanings, as well as self-reported data about COVID-19 exposure and physical and mental health status. Some of these survey questions were repeated at periodic intervals in subsequent weeks, providing quantitative measures of change over time that can be analyzed in conjunction with participants' qualitative entries. Surveys employed validated questions where possible. The core of PJP-1 involved two weekly opportunities to create journal entries in the format of their choice (text, image, and/or audio). Each week, journalers received a link with an invitation to create one entry in response to a recurring narrative prompt (“How has the COVID-19 pandemic affected your life in the past week?”) and a second journal entry in response to their choice of two more tightly focused prompts. Typically the pair of prompts included one focusing on subjective experience (e.g., the impact of the pandemic on relationships, sense of social connectedness, or mental health) and another with an external focus (e.g., key sources of scientific information, trust in government, or COVID-19’s economic impact). Each week,...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is an extension of my previous work on creating a dataset for natural language processing tasks. It leverages binary representation to characterise various machine learning models. The attributes in the dataset are derived from a dictionary, which was constructed from a corpus of prompts typically provided to a large language model (LLM). These prompts reference specific machine learning algorithms and their implementations. For instance, consider a user asking an LLM or a generative AI to create a Multi-Layer Perceptron (MLP) model for a particular application. By applying this concept to multiple machine learning models, we constructed our corpus. This corpus was then transformed into the current dataset using a bag-of-words approach. In this dataset, each attribute corresponds to a word from our dictionary, represented as a binary value: 1 indicates the presence of the word in a given prompt, and 0 indicates its absence. At the end of each entry, there is a label. Each entry in the dataset pertains to a single class, where each class represents a distinct machine learning model or algorithm. This dataset is intended for multi-class classification tasks, not multi-label classification, as each entry is associated with only one label and does not belong to multiple labels simultaneously. This dataset has been utilised with a Convolutional Neural Network (CNN) using the Keras Automodel API, achieving impressive training and testing accuracy rates exceeding 97%. Post-training, the model's predictive performance was rigorously evaluated in a production environment, where it continued to demonstrate exceptional accuracy. For this evaluation, we employed a series of questions, which are listed below. These questions were intentionally designed to be similar to ensure that the model can effectively distinguish between different machine learning models, even when the prompts are closely related.
KNN How would you create a KNN model to classify emails as spam or not spam based on their content and metadata? How could you implement a KNN model to classify handwritten digits using the MNIST dataset? How would you use a KNN approach to build a recommendation system for suggesting movies to users based on their ratings and preferences? How could you employ a KNN algorithm to predict the price of a house based on features such as its location, size, and number of bedrooms etc? Can you create a KNN model for classifying different species of flowers based on their petal length, petal width, sepal length, and sepal width? How would you utilise a KNN model to predict the sentiment (positive, negative, or neutral) of text reviews or comments? Can you create a KNN model for me that could be used in malware classification? Can you make me a KNN model that can detect a network intrusion when looking at encrypted network traffic? Can you make a KNN model that would predict the stock price of a given stock for the next week? Can you create a KNN model that could be used to detect malware when using a dataset relating to certain permissions a piece of software may have access to?
Decision Tree Can you describe the steps involved in building a decision tree model to classify medical images as malignant or benign for cancer diagnosis and return a model for me? How can you utilise a decision tree approach to develop a model for classifying news articles into different categories (e.g., politics, sports, entertainment) based on their textual content? What approach would you take to create a decision tree model for recommending personalised university courses to students based on their academic strengths and weaknesses? Can you describe how to create a decision tree model for identifying potential fraud in financial transactions based on transaction history, user behaviour, and other relevant data? In what ways might you apply a decision tree model to classify customer complaints into different categories determining the severity of language used? Can you create a decision tree classifier for me? Can you make me a decision tree model that will help me determine the best course of action across a given set of strategies? Can you create a decision tree model for me that can recommend certain cars to customers based on their preferences and budget? How can you make a decision tree model that will predict the movement of star constellations in the sky based on data provided by the NASA website? How do I create a decision tree for time-series forecasting?
Random Forest Can you describe the steps involved in building a random forest model to classify different types of anomalies in network traffic data for cybersecurity purposes and return the code for me? In what ways could you implement a random forest model to predict the severity of traffic congestion in urban areas based on historical traffic patterns, weather...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).
This dataset contains:
The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.
A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.
A number of Python scripts that are used in above notebooks.
two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.
An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .
Datasets corresponding to the different figures.
One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020
One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).
A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:
travel-time arrival predictions from every source to all stations (df_stations...pkl)
misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)
addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type
How to use this dataset:
To set up the conda environment:
make sure you have anaconda/miniconda
make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.
Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.
Additionally in your conda env, install basemap and cartopy:
conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy
Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.
To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.
Figure 1: separate notebook, Fig1_event_98.py
Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py
Figures 3-7: Figures_perturbation_study.py
Figures 8-10: Figures_toy_inversions.py
To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.
To recreate the complete Salvus project: This can be done using:
the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)
the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py
For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.
References:
Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469
Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020
Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
This repository contains the MetaGraspNet Dataset described in the paper "MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis" (https://arxiv.org/abs/2112.14663 ).
There has been increasing interest in smart factories powered by robotics systems to tackle repetitive, laborious tasks. One particular impactful yet challenging task in robotics-powered smart factory applications is robotic grasping: using robotic arms to grasp objects autonomously in different settings. Robotic grasping requires a variety of computer vision tasks such as object detection, segmentation, grasp prediction, pick planning, etc. While significant progress has been made in leveraging of machine learning for robotic grasping, particularly with deep learning, a big challenge remains in the need for large-scale, high-quality RGBD datasets that cover a wide diversity of scenarios and permutations.
To tackle this big, diverse data problem, we are inspired by the recent rise in the concept of metaverse, which has greatly closed the gap between virtual worlds and the physical world. In particular, metaverses allow us to create digital twins of real-world manufacturing scenarios and to virtually create different scenarios from which large volumes of data can be generated for training models. We present MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types, and is split into 5 difficulties to evaluate object detection and segmentation model performance in different grasping scenarios. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance in a manner that is more appropriate for robotic grasp applications compared to existing general-purpose performance metrics. This repository contains the first phase of MetaGraspNet benchmark dataset which includes detailed object detection, segmentation, layout annotations, and a script for layout-weighted performance metric (https://github.com/y2863/MetaGraspNet ).
https://raw.githubusercontent.com/y2863/MetaGraspNet/main/.github/500.png">
If you use MetaGraspNet dataset or metric in your research, please use the following BibTeX entry.
BibTeX
@article{chen2021metagraspnet,
author = {Yuhao Chen and E. Zhixuan Zeng and Maximilian Gilles and
Alexander Wong},
title = {MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis},
journal = {arXiv preprint arXiv:2112.14663},
year = {2021}
}
This dataset is arranged in the following file structure:
root
|-- meta-grasp
|-- scene0
|-- 0_camera_params.json
|-- 0_depth.png
|-- 0_rgb.png
|-- 0_order.csv
...
|-- scene1
...
|-- difficulty-n-coco-label.json
Each scene is an unique arrangement of objects, which we then display at various different angles. For each shot of a scene, we provide the camera parameters (x_camara_params.json
), a depth image (x_depth.png
), an rgb image (x_rgb.png
), as well as a matrix representation of the ordering of each object (x_order.csv
). The full label for the image are all available in difficulty-n-coco-label.json
(where n is the difficulty level of the dataset) in the coco data format.
The matrix describes a pairwise obstruction relationship between each object within the image. Given a "parent" object covering a "child" object:
relationship_matrix[child_id, parent_id] = -1
The NYC Parks Events Listing database is used to store event information displayed on the Parks website, nyc.gov/parks. There are seven related tables that make up the this database: Events_Events table (This is the primary table that contains basic data about every event. Each record is an event.) Events_Categories (Each record is a category describing an event. One event can be in more than one category.) Events_Images (Each record is an image related to an event. One event can have more than one image.) Events_Links (Each record is a link with more information about an event. One event can have more than one link.) Events_Locations (Each record is a location where an event takes place. One event can have more than one location.) Events_Organizers (Each record contains a group or person organizing an event. One event can have more than one organizer.) Events_YouTube (Each record is a link to a YouTube video about an event. One event can have more than one YouTube video.) The Events_Events table is the primary table. All other tables can be related by joining on the event_id. This data contains records from 2013 and on. For a complete list of related datasets, please follow This Link
Q1: What is the best time to book a United Airlines flight for a lower fare?
☎️+1 (888) 706-5253 is the number to call if you're trying to find the best time to book United Airlines flights for lower fares. Historically, the best time to book is about 1 to 3 months before your departure for domestic flights and 2 to 8 months ahead for international routes. Prices often dip mid-week, especially on Tuesdays and Wednesdays, so calling ☎️+1 (888) 706-5253 during these windows can give you a better shot at scoring deals.
☎️+1 (888) 706-5253 is also helpful when trying to align with United’s flash sales or limited-time promotions that aren't always well-advertised. Travel experts at this number can guide you on fare trends based on your specific routes. For maximum savings, it’s recommended you call ☎️+1 (888) 706-5253 and ask about fare prediction tools or price alerts.
Flexibility is key—being open to alternative dates or nearby airports can save you hundreds. Contact ☎️+1 (888) 706-5253 for help navigating these options and optimizing your itinerary. The agents at ☎️+1 (888) 706-5253 can also assist in combining miles with fare sales for even deeper savings.
Customer Reviews:
Sophia R., ⭐⭐⭐⭐⭐: “Saved $220 just by booking after a quick call to ☎️+1 (888) 706-5253. Amazing tips!”
Jason M., ⭐⭐⭐⭐: “Helpful reps, gave me the perfect time to book. Thanks, ☎️+1 (888) 706-5253!”
Emily T., ⭐⭐⭐⭐⭐: “Would’ve missed the sale if not for ☎️+1 (888) 706-5253—highly recommended!”
Q2: What is the fastest way to book a United Airlines flight?
☎️+1 (888) 706-5253 is the fastest and most reliable number to call when you're in a rush to book a United Airlines flight. Speaking to a live travel advisor ensures real-time support, immediate confirmation, and often access to unpublished fares. Whether you're booking last minute or just want everything handled efficiently, call ☎️+1 (888) 706-5253 to complete your booking in minutes.
☎️+1 (888) 706-5253 agents can streamline your booking process by handling flight selection, baggage add-ons, seat preferences, and even travel insurance in one conversation. If your travel plans are urgent or complex, ☎️+1 (888) 706-5253 is much faster than navigating the United Airlines website or mobile app on your own.
Booking through ☎️+1 (888) 706-5253 can also help avoid mistakes like date errors or missed discounts, which can happen with rushed online bookings. Don’t waste time—call ☎️+1 (888) 706-5253 for fast, accurate booking with immediate support.
Customer Reviews:
Kevin L., ⭐⭐⭐⭐⭐: “Needed a flight in an hour—booked in 10 minutes via ☎️+1 (888) 706-5253!”
Laura B., ⭐⭐⭐⭐: “Super fast and friendly. No hold time. Thanks ☎️+1 (888) 706-5253!”
Chris D., ⭐⭐⭐⭐⭐: “Quickest way to book. Great support from ☎️+1 (888) 706-5253!”
Q3: How do I talk to someone at United Airlines about a booking?
☎️+1 (888) 706-5253 is your direct link to speak with someone about any United Airlines booking issues or questions. Whether you're modifying an existing itinerary, checking baggage rules, or confirming travel credits, the agents at ☎️+1 (888) 706-5253 can assist immediately. It’s far quicker than navigating United’s automated phone system.
You can avoid long wait times and confusion by reaching out to ☎️+1 (888) 706-5253, where travel specialists are trained to handle United Airlines reservations. Their expertise helps resolve booking complications, ticket upgrades, name corrections, and more. When urgent help is needed, ☎️+1 (888) 706-5253 is far superior to trying to resolve issues online.
Even if you booked your flight elsewhere, the team at ☎️+1 (888) 706-5253 can act as a go-between with United and advocate for changes or refunds. For real answers and human help, call ☎️+1 (888) 706-5253 right away.
Customer Reviews:
Hannah G., ⭐⭐⭐⭐⭐: “No more long holds with United. Just called ☎️+1 (888) 706-5253—they fixed it fast!”
James K., ⭐⭐⭐⭐: “Better than calling the airline directly. ☎️+1 (888) 706-5253 got me a solution in 5 mins.”
Maya P., ⭐⭐⭐⭐⭐: “Amazing customer care from ☎️+1 (888) 706-5253. Highly recommend for booking support.”
Q4: Can I call United Airlines for reservations quickly?
☎️+1 (888) 706-5253 is your best option to make a quick call and get United Airlines reservations handled immediately. Unlike the standard United call center, this number connects you directly with live agents who specialize in rapid booking. For those in a hurry or looking to avoid hold times, ☎️+1 (888) 706-5253 is the fastest route.
With ☎️+1 (888) 706-5253, you can finalize a reservation in one call—no waiting for online confirmations or bouncing between web pages. Whether you're booking for a solo trip or a group, calling ☎️+1 (888) 706-5253 ensures the reservation process is smooth and efficient from start to finish.
Quick reservations also come with expert advice when you call ☎️+1 (888) 706-5253, including seat upgrades, fare classes, and best departure times. Save time and stress by making United reservations through ☎️+1 (888) 706-5253 anytime, even on weekends.
Customer Reviews:
Rachel N., ⭐⭐⭐⭐⭐: “Took less than 6 minutes—awesome team at ☎️+1 (888) 706-5253.”
Eric W., ⭐⭐⭐⭐: “Booking was fast and hassle-free with ☎️+1 (888) 706-5253.”
Diana S., ⭐⭐⭐⭐⭐: “Perfect for last-minute plans. I trust ☎️+1 (888) 706-5253 every time.”
Q5: How can I make a reservation with United Airlines?
☎️+1 (888) 706-5253 is the ideal number to call if you want to make a reservation with United Airlines without the confusion of online portals. The booking agents at ☎️+1 (888) 706-5253 will walk you through each step, from flight selection to payment. This ensures you get the best fare and schedule options in one simple conversation.
Booking through ☎️+1 (888) 706-5253 also gives you access to fare bundles, seat upgrades, and flexible cancellation policies. If you're unsure about dates or need to coordinate with other travelers, the team at ☎️+1 (888) 706-5253 will accommodate your needs and explain your options clearly.
For families, business travelers, or vacationers, calling ☎️+1 (888) 706-5253 is a smarter, more personalized way to reserve flights. Don’t take chances with online errors—get real-time support and secure your United reservation fast by calling ☎️+1 (888) 706-5253 now.
Customer Reviews:
Liam F., ⭐⭐⭐⭐⭐: “Best reservation experience I’ve had. ☎️+1 (888) 706-5253 made it so easy.”
Ava C., ⭐⭐⭐⭐: “Got extra legroom and meal add-ons—thanks ☎️+1 (888) 706-5253!”
Noah J., ⭐⭐⭐⭐⭐: “Never booking online again. ☎️+1 (888) 706-5253 is my go-to.”
Yes, you absolutely can make a reservation by calling Lufthansa Airlines directly. ✈️📞+1(877) 471-1812 This method is convenient, reliable, and efficient, especially for travelers who prefer human interaction. ✈️📞+1(877) 471-1812 Calling Lufthansa ensures that your queries are answered in real time, and any custom requests are accurately documented. ✈️📞+1(877) 471-1812
When you dial ✈️📞+1(877) 471-1812, you are immediately connected with a Lufthansa booking expert. ✈️📞+1(877) 471-1812 These representatives are professionally trained to handle all types of reservations, from economy to first class. ✈️📞+1(877) 471-1812 Whether it’s a one-way trip, round trip, or multi-city itinerary, booking by phone is fully supported. ✈️📞+1(877) 471-1812
The phone reservation process is user-friendly and personalized. ✈️📞+1(877) 471-1812 You’ll be asked for your travel dates, destination, number of passengers, and preferred class. ✈️📞+1(877) 471-1812 The agent can then provide you with options, explain the fare types, and even suggest route alternatives. ✈️📞+1(877) 471-1812
If you have special requirements like extra baggage, wheelchair assistance, or meal preferences, booking by phone helps. ✈️📞+1(877) 471-1812 You can communicate your needs clearly to a live representative. ✈️📞+1(877) 471-1812 This reduces the risk of booking errors or missing essential accommodations for your journey. ✈️📞+1(877) 471-1812
Senior citizens, families, and first-time flyers often find phone booking more reassuring. ✈️📞+1(877) 471-1812 The agent explains every detail and ensures you understand terms, conditions, and cancellation policies. ✈️📞+1(877) 471-1812 If you’re not tech-savvy or don’t feel confident booking online, calling is the best route. ✈️📞+1(877) 471-1812
Lufthansa phone agents can also access promotions and unpublished fares. ✈️📞+1(877) 471-1812 Sometimes, better deals are available by phone that you won’t see online. ✈️📞+1(877) 471-1812 Especially during seasonal sales or if you’re booking on short notice, speaking directly may help you save. ✈️📞+1(877) 471-1812
You’ll also receive your booking confirmation via email or SMS after a successful phone reservation. ✈️📞+1(877) 471-1812 This provides instant peace of mind and allows you to review all the details. ✈️📞+1(877) 471-1812 If there’s anything incorrect, you can call back to fix it immediately. ✈️📞+1(877) 471-1812
Another advantage is the ability to hold a reservation temporarily. ✈️📞+1(877) 471-1812 In some cases, the agent may allow you to hold a fare for 24 hours. ✈️📞+1(877) 471-1812 This gives you time to finalize your plans before committing to the purchase. ✈️📞+1(877) 471-1812
Phone reservations are also ideal for corporate or group travel. ✈️📞+1(877) 471-1812 Lufthansa offers special services and discounts for large bookings, often best arranged by speaking to a representative. ✈️📞+1(877) 471-1812 These bookings may include additional benefits like seat assignments or flexible change policies. ✈️📞+1(877) 471-1812
If you’re an elite flyer or Miles & More member, calling Lufthansa allows you to apply points easily. ✈️📞+1(877) 471-1812 Agents can check your balance and redeem them for upgrades or ticket purchases. ✈️📞+1(877) 471-1812 This personalized service adds value to your loyalty membership and streamlines the process. ✈️📞+1(877) 471-1812
In conclusion, yes—you can make a Lufthansa reservation by calling. ✈️📞+1(877) 471-1812 This method offers personalized service, fast processing, and human support that many travelers find invaluable. ✈️📞+1(877) 471-1812 If you're ready to book or have questions, dial ✈️📞+1(877) 471-1812 for assistance today.
Lending Club offers peer-to-peer (P2P) loans through a technological platform for various personal finance purposes and is today one of the companies that dominate the US P2P lending market. The original dataset is publicly available on Kaggle and corresponds to all the loans issued by Lending Club between 2007 and 2018. The present version of the dataset is for constructing a granting model, that is, a model designed to make decisions on whether to grant a loan based on information available at the time of the loan application. Consequently, our dataset only has a selection of variables from the original one, which are the variables known at the moment the loan request is made. Furthermore, the target variable of a granting model represents the final status of the loan, that are "default" or "fully paid". Thus, we filtered out from the original dataset all the loans in transitory states. Our dataset comprises 1,347,681 records or obligations (approximately 60% of the original) and it was also cleaned for completeness and consistency (less than 1% of our dataset was filtered out).
TARGET VARIABLE
The dataset includes a target variable based on the final resolution of the credit: the default category corresponds to the event charged off and the non-default category to the event fully paid. It does not consider other values in the loan status variable since this variable represents the state of the loan at the end of the considered time window. Thus, there are no loans in transitory states. The original dataset includes the target variable “loan status”, which contains several categories ('Fully Paid', 'Current', 'Charged Off', 'In Grace Period', 'Late (31-120 days)', 'Late (16-30 days)', 'Default'). However, in our dataset, we just consider loans that are either “Fully Paid” or “Default” and transform this variable into a binary variable called “Default”, with a 0 for fully paid loans and a 1 for defaulted loans.
EXPLANATORY VARIABLES
The explanatory variables that we use correspond only to the information available at the time of the application. Variables such as the interest rate, grade, or subgrade are generated by the company as a result of a credit risk assessment process, so they were filtered out from the dataset as they must not be considered in risk models to predict the default in granting of credit.
FULL LIST OF VARIABLES
Loan identification variables:
id: Loan id (unique identifier).
issue_d: Month and year in which the loan was approved.
Quantitative variables:
revenue: Borrower's self-declared annual income during registration.
dti_n: Indebtedness ratio for obligations excluding mortgage. Monthly information. This ratio has been calculated considering the indebtedness of the whole group of applicants. It is estimated as the ratio calculated using the co-borrowers’ total payments on the total debt obligations divided by the co-borrowers’ combined monthly income.
loan_amnt: Amount of credit requested by the borrower.
fico_n: Defined between 300 and 850, reported by Fair Isaac Corporation as a risk measure based on historical credit information reported at the time of application. This value has been calculated as the average of the variables “fico_range_low” and “fico_range_high” in the original dataset.
experience_c: Binary variable that indicates whether the borrower is new to the entity. This variable is constructed from the credit date of the previous obligation in LC and the credit date of the current obligation; if the difference between dates is positive, it is not considered as a new experience with LC.
Categorical variables:
emp_length: Categorical variable with the employment length of the borrower (includes the no information category)
purpose: Credit purpose category for the loan request.
home_ownership_n: Homeownership status provided by the borrower in the registration process. Categories defined by LC: “mortgage”, “rent”, “own”, “other”, “any”, “none”. We merged the categories “other”, “any” and “none” as “other”.
addr_state: Borrower's residence state from the USA.
zip_code: Zip code of the borrower's residence.
Textual variables
title: Title of the credit request description provided by the borrower.
desc: Description of the credit request provided by the borrower.
We cleaned the textual variables. First, we removed all those descriptions that contained the default description provided by Lending Club on its web form (“Tell your story. What is your loan for?”). Moreover, we removed the prefix “Borrower added on DD/MM/YYYY >” from the descriptions to avoid any temporal background on them. Finally, as these descriptions came from a web form, we substituted all the HTML elements by their character (e.g. “&” was substituted by “&”, “<” was substituted by “<”, etc.).
RELATED WORKS
This dataset has been used in the following academic articles:
Sanz-Guerrero, M. Arroyo, J. (2024). Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending. arXiv preprint arXiv:2401.16458. https://doi.org/10.48550/arXiv.2401.16458
Ariza-Garzón, M.J., Arroyo, J., Caparrini, A., Segovia-Vargas, M.J. (2020). Explainability of a machine learning granting scoring model in peer-to-peer lending. IEEE Access 8, 64873 - 64890. https://doi.org/10.1109/ACCESS.2020.2984412
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Art&Emotion experiment description
The Art & Emotions dataset was collected in the scope of EU funded research project SPICE (https://cordis.europa.eu/project/id/870811) with the goal of investigating the relationship between art and emotions and collecting written data (User Generated Content) in the domain of arts in all the languages of the SPICE project (fi, en, es, he, it). The data was collected through a set of Google Forms (one for each language) and it was used in the project (along the other datasets collected by museums in the different project use cases) in order to train and test Emotion Detection Models within the project.
The experiment consists of 12 artworks, chosen from a group of artworks provided by the GAM Museum of Turin (https://www.gamtorino.it/) one of the project partners. Each artwork is presented in a different section of the form; for each of the artworks, the user is asked to answer 5 open questions:
What do you see in this picture? Write what strikes you most in this image.
What does this artwork make you think about? Write the thoughts and memories that the picture evokes.
How does this painting make you feel? Write the feelings and emotions that the picture evokes in you
What title would you give to this artwork?
Now choose one or more emoji to associate with your feelings looking at this artwork. You can also select "other" and insert other emojis by copying them from this link: https://emojipedia.org/
For each of the artworks, the user can decide whether to skip to the next artwork, if he does not like the one in front of him or go back to the previous artworks and modify the answers. It is not mandatory to fill all the questions for a given artwork.
The question about emotions is left open so as not to force the person to choose emotions from a list of tags which are the tags of a model (e.g. Plutchik), but leaving him free to express the different shades of emotions that can be felt.
Before getting to the heart of the experiment, with the artworks sections, the user is asked to leave some personal information (anonymously), to help us getting an idea of the type of users who participated in the experiment.
The questions are:
Age (open)
Gender (male, female, prefer not to say, other (open))
How would you define your relationship with art?
My job is related to the art world
I am passionate about the art
I am a little interested in art
I am not interested in art
4. Do you like going to museums or art exhibitions?
I like to visit museums frequently
I go occasionally to museums or art exhibitions
I rarely visit museums or art exhibitions
Dataset structure:
FI.csv: form data (personal data + open questions) in Finnish (UTF-8)
EN.csv: form data (personal data + open questions) in English (UTF-8)
ES.csv: form data (personal data + open questions) in Spanish (UTF-8)
HE.csv: form data (personal data + open questions) in Hebrew (UTF-8)
IT.csv: form data (personal data + open questions) in Italian (UTF-8)
artworks.csv: the list of artworks including title, author, picture name (the pictures can be found in pictures.zip) and the mapping between the columns in the form data and the questions about that artwork
pictures.zip: the jpeg of the artworks
On an annual basis (individual hospital fiscal year), individual hospitals and hospital systems report detailed facility-level data on services capacity, inpatient/outpatient utilization, patients, revenues and expenses by type and payer, balance sheet and income statement.
Due to the large size of the complete dataset, a selected set of data representing a wide range of commonly used data items, has been created that can be easily managed and downloaded. The selected data file includes general hospital information, utilization data by payer, revenue data by payer, expense data by natural expense category, financial ratios, and labor information.
There are two groups of data contained in this dataset: 1) Selected Data - Calendar Year: To make it easier to compare hospitals by year, hospital reports with report periods ending within a given calendar year are grouped together. The Pivot Tables for a specific calendar year are also found here. 2) Selected Data - Fiscal Year: Hospital reports with report periods ending within a given fiscal year (July-June) are grouped together.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Hard Hat
dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.
The original dataset has a 75/25 train-test split.
Example Image:
https://i.imgur.com/7spoIJT.png" alt="Example Image">
One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.
Use the fork
or Download this Dataset
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Image Preprocessing | Image Augmentation | Modify Classes
* v1
(resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations
* v2
(raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images
* v3
(v3): generated with the original 75/25 train-test split | Modify Classes used to drop person
class | Preprocessing and Augmentation applied
* v5
(raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person
class
* v8
(raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head
and person
classes
* v9
(raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head
and helmet
classes
* v10
(raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images
* v11
(augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model
* v12
(augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person
class | 3x image generation | Trained with Roboflow's Fast Model
* v13
(augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person
class | 3x image generation | Trained with Roboflow's Accurate Model
* v14
(raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person
class, and remap/relabel helmet
class to head
Choosing Between Computer Vision Model Sizes | Roboflow Train
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.
Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.
This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.
The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.
For more information:
NNDSS Supports the COVID-19 Response | CDC.
The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.
All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.
To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.
CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.
For questions, please contact Ask SRRG (eocevent394@cdc.gov).
COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These
Market basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These two datasets represent sensor events collected in the CASAS smart apartment testbed at Washington State University. In both sets of data, ambient sensor readings are collected while 20 participants performing five ADL activities in the apartment. This resource is valuable for designing and validating activity recognition algorithms. Further, this resource provides data for detecting errors that are helpful in assessing and intervening for functional independence.
In the adl_noerror dataset, the five tasks are:
In the adl_error dataset, a scripted error is introduced. The errors are:
The files are named according to the participant number and task number (e.g., p01.t1.csv contains sensor data for participant 1 performing task 1). There is one sensor reading in each row with fields date, time, sensor, and message.
A floorplan of the smart apartment is provided in Chinook.png, together with the locations of the sensors. A zoomed-in look at the Chinook cabinet with sensors is provided in Chinook_Cabinet.png. The sensors are categorized (and named) as:
This table presents income shares, thresholds, tax shares, and total counts of individual Canadian tax filers, with a focus on high income individuals (95% income threshold, 99% threshold, etc.). Income thresholds are based on national threshold values, regardless of selected geography; for example, the number of Nova Scotians in the top 1% will be calculated as the number of taxfiling Nova Scotians whose total income exceeded the 99% national income threshold. Different definitions of income are available in the table namely market, total, and after-tax income, both with and without capital gains.