100+ datasets found
  1. High income tax filers in Canada

    • www150.statcan.gc.ca
    • open.canada.ca
    • +1more
    Updated Oct 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Government of Canada, Statistics Canada (2024). High income tax filers in Canada [Dataset]. http://doi.org/10.25318/1110005501-eng
    Explore at:
    Dataset updated
    Oct 28, 2024
    Dataset provided by
    Statistics Canadahttps://statcan.gc.ca/en
    Area covered
    Canada
    Description

    This table presents income shares, thresholds, tax shares, and total counts of individual Canadian tax filers, with a focus on high income individuals (95% income threshold, 99% threshold, etc.). Income thresholds are based on national threshold values, regardless of selected geography; for example, the number of Nova Scotians in the top 1% will be calculated as the number of taxfiling Nova Scotians whose total income exceeded the 99% national income threshold. Different definitions of income are available in the table namely market, total, and after-tax income, both with and without capital gains.

  2. Simulation Data Set

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  3. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(150021566586 bytes)Available download formats
    Dataset updated
    Jul 24, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  4. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  5. H

    Survey of Income and Program Participation (SIPP)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Survey of Income and Program Participation (SIPP) [Dataset]. http://doi.org/10.7910/DVN/I0FFJV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the survey of income and program participation (sipp) with r if the census bureau's budget was gutted and only one complex sample survey survived, pray it's the survey of income and program participation (sipp). it's giant. it's rich with variables. it's monthly. it follows households over three, four, now five year panels. the congressional budget office uses it for their health insurance simulation . analysts read that sipp has person-month files, get scurred, and retreat to inferior options. the american community survey may be the mount everest of survey data, but sipp is most certainly the amazon. questions swing wild and free through the jungle canopy i mean core data dictionary. legend has it that there are still species of topical module variables that scientists like you have yet to analyze. ponce de león would've loved it here. ponce. what a name. what a guy. the sipp 2008 panel data started from a sample of 105,663 individuals in 42,030 households. once the sample gets drawn, the census bureau surveys one-fourth of the respondents every four months, over f our or five years (panel durations vary). you absolutely must read and understand pdf pages 3, 4, and 5 of this document before starting any analysis (start at the header 'waves and rotation groups'). if you don't comprehend what's going on, try their survey design tutorial. since sipp collects information from respondents regarding every month over the duration of the panel, you'll need to be hyper-aware of whether you want your results to be point-in-time, annualized, or specific to some other period. the analysis scripts below provide examples of each. at every four-month interview point, every respondent answers every core question for the previous four months. after that, wave-specific addenda (called topical modules) get asked, but generally only regarding a single prior month. to repeat: core wave files contain four records per person, topical modules contain one. if you stacked every core wave, you would have one record per person per month for the duration o f the panel. mmmassive. ~100,000 respondents x 12 months x ~4 years. have an analysis plan before you start writing code so you extract exactly what you need, nothing more. better yet, modify something of mine. cool? this new github repository contains eight, you read me, eight scripts: 1996 panel - download and create database.R 2001 panel - download and create database.R 2004 panel - download and create database.R 2008 panel - download and create database.R since some variables are character strings in one file and integers in anoth er, initiate an r function to harmonize variable class inconsistencies in the sas importation scripts properly handle the parentheses seen in a few of the sas importation scripts, because the SAScii package currently does not create an rsqlite database, initiate a variant of the read.SAScii function that imports ascii data directly into a sql database (.db) download each microdata file - weights, topical modules, everything - then read 'em into sql 2008 panel - full year analysis examples.R< br /> define which waves and specific variables to pull into ram, based on the year chosen loop through each of twelve months, constructing a single-year temporary table inside the database read that twelve-month file into working memory, then save it for faster loading later if you like read the main and replicate weights columns into working memory too, merge everything construct a few annualized and demographic columns using all twelve months' worth of information construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half, again save it for faster loading later, only if you're so inclined reproduce census-publish ed statistics, not precisely (due to topcoding described here on pdf page 19) 2008 panel - point-in-time analysis examples.R define which wave(s) and specific variables to pull into ram, based on the calendar month chosen read that interview point (srefmon)- or calendar month (rhcalmn)-based file into working memory read the topical module and replicate weights files into working memory too, merge it like you mean it construct a few new, exciting variables using both core and topical module questions construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half reproduce census-published statistics, not exactly cuz the authors of this brief used the generalized variance formula (gvf) to calculate the margin of error - see pdf page 4 for more detail - the friendly statisticians at census recommend using the replicate weights whenever possible. oh hayy, now it is. 2008 panel - median value of household assets.R define which wave(s) and spe cific variables to pull into ram, based on the topical module chosen read the topical module and replicate weights files into working memory too, merge once again construct a replicate-weighted complex sample design with a...

  6. Q

    Data for: The Pandemic Journaling Project, Phase One (PJP-1)

    • data.qdr.syr.edu
    3gp +22
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah S. Willen; Sarah S. Willen; Katherine A. Mason; Katherine A. Mason (2024). Data for: The Pandemic Journaling Project, Phase One (PJP-1) [Dataset]. http://doi.org/10.5064/F6PXS9ZK
    Explore at:
    jpeg(-1), jpeg(64787), png(-1), jpeg(2635904), jpeg(2809706), jpeg(3128025), jpeg(3522579), mp4a(609792), jpeg(2715246), jpeg(564843), mp4a(1607020), jpeg(29277), jpeg(411392), jpeg(3219184), html(64045635), jpeg(1455187), jpeg(3953592), jpeg(445647), jpeg(3079564), png(858132), jpeg(3262275), jpeg(5268315), jpeg(1173279), mp4a(4746585), mp4a(506955), jpeg(2228793), jpeg(2399356), jpeg(1847185), png(1487656), mp4a(3329780), mp4a(1503462), bin(-1), jpeg(3226310), mp4a(2843558), jpeg(3161075), jpeg(2535033), jpeg(1814204), mp4a(1403036), jpeg(6831581), jpeg(3500892), jpeg(2063706), jpeg(2867362), jpeg(36303), mp4a(608702), jpeg(2174907), jpeg(2775382), mpga(3119325), pdf(-1), html(28046914), jpeg(2571274), qt(642282), gif(-1), bin(1475326), jpeg(1669679), jpeg(288031), mp4(16611275), jpeg(3758294), mp4a(1316029), mp4a(2192000), jpeg(51905), mpga(3284435), jpeg(47621), jpeg(806714), jpeg(3720630), mp4a(2496251), jpeg(2320221), jpeg(4266931), jpeg(3779944), jpeg(2036741), jpeg(73283), jpeg(460192), jpeg(81002), jpeg(1794407), jpeg(843851), jpeg(134732), bin(1324105), mp4(-1), html(3785552), bin(446182), jpeg(126557), jpeg(112141), jpeg(99013), jpeg(2763037), jpeg(2904103), mp4a(3455446), jpeg(2690540), mpga(3655410), jpeg(2348580), mp4a(8043573), jpeg(4103780), mp4a(2090318), jpeg(3309302), xlsx(34600), jpeg(3101557), qt(-1), jpeg(2597912), jpeg(197952), jpeg(528533), jpeg(2484777), jpeg(17026260), jpeg(31091), jpeg(1143472), jpeg(2705547), jpeg(4634609), mp4a(2427794), mp4a(865561), qt(6530289), jpeg(2750981), mp4a(431473), jpeg(4477949), jpeg(5588285), mp4a(1258547), jpeg(44679), jpeg(5718836), jpeg(2169748), mp4a(4727052), jpeg(4410466), jpeg(359020), jpeg(319878), jpeg(3348421), jpeg(2742034), jpeg(479908), jpeg(2871901), jpeg(754914), mpga(3369080), audio/vnd.dlna.adts(2291450), bin(925606), mp4a(1468479), mp4a(3505956), mp4a(934968), jpeg(94576), mp4a(954136), png(1217841), png(259675), jpeg(2768465), jpeg(7435869), mp4a(558160), jpeg(452676), jpeg(2614435), jpeg(2295874), jpeg(2985176), jpeg(2382774), jpeg(1836889), mp4a(714107), jpeg(3058184), png(4809397), png(291188), jpeg(476581), bin(315174), mp4a(963668), mp4a(1691796), jpeg(305566), jpeg(2340053), mp4a(1416194), jpeg(2187251), mp4a(1480696), jpeg(1224621), jpeg(799339), jpeg(2106618), mp4a(2234556), html(59903646), jpeg(1502693), jpeg(496111), mp4a(710717), pdf(791867), jpeg(2320307), mp4a(2723319), jpeg(2588596), qt(6524117), jpeg(706630), jpeg(1797399), jpeg(3578041), png(34340), jpeg(413917), jpeg(2018007), mp4a(1822023), mp4a(546214), jpeg(104863), png(505848), jpeg(3999644), jpeg(2202086), jpeg(1779668), webm(2501579), jpeg(3644901), mpga(61021), xlsx(19458121), jpeg(3678114), jpeg(3195259), mp4a(5998805), mp4a(1089264), mpga(1223745), png(79931), ogv(921344), mp4a(5290770), mp4a(537339), mp4a(2522582), mp4a(2757638), mp4a(902919), mp4a(3664250), jpeg(293524), jpeg(1611225), jpeg(78426), audio/vnd.dlna.adts(3577011), jpeg(1425684), jpeg(2114989), png(2239184), jpeg(3532208), jpeg(2599799), jpeg(4051592), mp4a(766677), bin(1140735), mp4a(1950073), jpeg(2482637), mp4a(9461846), mp4a(886225), mp4a(2275458), jpeg(3964175), png(7323654), mp4a(3407172), jpeg(1662239), jpeg(2738720), jpeg(2680408), jpeg(875989), mp4a(1135778), jpeg(3063173), mp4a(1044083), mp4a(3068302), jpeg(4586435), jpeg(944028), jpeg(65604), jpeg(803886), mp4a(3207845), jpeg(9303719), jpeg(1178560), mpga(1096992), mp4a(273265), jpeg(37593), jpeg(148529), jpeg(516395), html(799294), mp4a(1064123), jpeg(647105), jpeg(3412037), bin(3742158), jpeg(2343745), jpeg(2242087), jpeg(1153242), mp4a(700840), mp4a(614290), png(674974), mp4a(462181), mp4a(3341713), mp4a(5455315), bin(1700382), png(7882498), jpeg(3098020), jpeg(2781328), mp4a(3763168), jpeg(4431416), mp4a(1614389), jpeg(287296), jpeg(2681973), jpeg(2107304), pdf(332485), jpeg(2635452), audio/vnd.dlna.adts(3058005), mp4a(2448226), mp4a(1805349), mp4a(4150285), mp4a(204164), jpeg(2606693), jpeg(2626157), mp4a(1459294), jpeg(566696), jpeg(2543785), mp4a(369050), mp4(30391500), jpeg(4579297), jpeg(5172226), jpeg(1548860), mp4a(944403), html(640739), jpeg(147544), jpeg(3964519), jpeg(1776724), mp4a(2984325), bin(1595391), jpeg(320684), bin(48838), jpeg(4079596), jpeg(2144716), mp4a(1642287), bin(616420), jpeg(4110243), html(799551), png(1792687), mp4a(962844), jpeg(2625613), jpeg(2666985), jpeg(2722455), jpeg(36852), jpeg(40164), jpeg(111950), mp4a(1235641), mp4a(101692), mp4a(489606), mp4a(1202077), mp4a(4721088), jpeg(63112), jpeg(3627878), mp4a(2368173), jpeg(6463999), mp4a(558864), jpeg(2818575), jpeg(950258), jpeg(4870478), jpeg(4661936), mp4a(828006), png(135414), jpeg(1511423), mpga(2579649), mpga(6283555), jpeg(39553), pdf(141529), bin(1084358), jpeg(379064), jpeg(1305368), mpga(625262), jpeg(4847317), bin(116966), wav(3184824), png(166019), jpeg(804562), jpeg(443742), jpeg(2216857), jpeg(539445), jpeg(2166243), png(1796101), jpeg(1875257), png(1640881), jpeg(2545361), png(441607), jpeg(2890369), mp4a(441334), jpeg(3591325), jpeg(130755), png(170479), mp4a(2620611), mp4a(4518524), mp4a(6386348), jpeg(2467582), mp4a(1084240), jpeg(95788), jpeg(2619585), mp4(8919033), jpeg(4410537), bin(1049901), jpeg(4145168), jpeg(1015520), png(108417), jpeg(11074031), mp4a(1034473), html(479151), jpeg(2543166), jpeg(1867990), jpeg(1688053), html(640918), jpeg(3761476), mp4a(2043016), mp4a(1327650), bin(443069), mp4a(8236358), jpeg(3333029), mp4a(4192934), jpeg(1964105), jpeg(3303164), jpeg(7390050), jpeg(3982230), jpeg(3033149), mp4a(705651), jpeg(45398), jpeg(1013777), jpeg(3386166), jpeg(3610339), jpeg(79582), jpeg(2749667), jpeg(3103944), jpeg(197437), jpeg(1240130), mp4a(3140356), mp4a(2218267), jpeg(5765324), jpeg(103691), jpeg(83984), jpeg(4445333), mp4a(634555), png(2280208), jpeg(3823557), jpeg(704279), mp4a(1632575), jpeg(2986691), bin(481830), jpeg(2921224), docx(-1), mp4a(5352815), ogv(650885), jpeg(421521), jpeg(3832698), html(3025837), audio/vnd.dlna.adts(3763036), bin(161414), jpeg(3634921), jpeg(175071), png(156532), jpeg(38705), jpeg(2969378), png(1059022), mp4a(1110381), bin(1812775), jpeg(1434922), bin(1048366), audio/vnd.dlna.adts(1787003), mp4a(795300), jpeg(2146419), jpeg(3113325), png(2690433), jpeg(2955817), jpeg(1950597), jpeg(180961), jpeg(2921263), png(1187248), jpeg(3661093), bin(1638526), mp4a(3258141), mp4a(2299616), audio/vnd.dlna.adts(6828390), png(4625953), jpeg(1806678), mp4a(1442751), jpeg(3484297), mp4a(581212), jpeg(2358438), jpeg(5251366), mp4a(856519), jpeg(895955), mp4a(225192), jpeg(1857109), png(396961), jpeg(6504102), jpeg(3550057), bin(642950), bin(726730), jpeg(2937002), jpeg(2241215), jpeg(2848793), jpeg(114301), jpeg(6851150), jpeg(5412996), jpeg(5099807), jpeg(2352338), mp4a(1108249), jpeg(59955), jpeg(597941), png(822965), png(279993), mp4a(649729), jpeg(5327907), html(41982439), jpeg(3926818), jpeg(3811126), mpga(3150075), mp4a(851987), jpeg(2161975), jpeg(3049221), mp4(14723059), mp4a(1166746), jpeg(3929963), jpeg(32386), bin(647846), jpeg(943529), png(3558483), mp4a(496459), jpeg(554775), jpeg(673727), jpeg(1234744), mp4a(1614229), bin(1077286), jpeg(2321955), mp4(15102498), jpeg(1138223), jpeg(2821667), mp4a(4957829), jpeg(5267053), jpeg(3746852), xlsx(66430625), png(1781350), mp4(13377154), jpeg(2521556), jpeg(4363031), jpeg(38838), jpeg(1177161), jpeg(5648135), jpeg(3860593), jpeg(3191081), jpeg(4074964), jpeg(2592942), jpeg(70743), jpeg(47092), jpeg(17155), mp4a(5461865), jpeg(317565), jpeg(154225), jpeg(2641570), jpeg(1432979), jpeg(2996468), jpeg(2537158), jpeg(2126839), mp4a(3445663), jpeg(524301), jpeg(2577631), mp4a(999933), jpeg(212728), jpeg(3050628), jpeg(67402), jpeg(4528980), jpeg(48108), jpeg(2849620), mp4a(799189), jpeg(977868), mp4a(1114948), mp4a(1538194), jpeg(3539999), jpeg(732964), mp4a(1159815), jpeg(177432), png(5221994), mp4a(120084), jpeg(4880331), jpeg(2634063), jpeg(1018097), webp(-1), bin(878982), jpeg(5596898), png(356862), jpeg(33015), mp4a(1665024), jpeg(1110786), xlsx(27165), jpeg(2034603), jpeg(2410690), mp4a(2172212), jpeg(287142), jpeg(865631), jpeg(4371438), mp4a(505909), bin(2410811), mp4a(416617), qt(5205385), jpeg(1642459), jpeg(1864894), mp4a(1275342), jpeg(4389684), mp4a(1216743), jpeg(1645086), mp4a(1917929), jpeg(2202466), jpeg(3415224), mp4a(2687040), jpeg(4168896), jpeg(3608610), mp4a(847604), jpeg(2952649), jpeg(1632186), jpeg(482523), jpeg(3260717), wav(2205734), ogv(332111), mp4a(3028452), jpeg(5449171), jpeg(2190017), html(646595), jpeg(2046616), jpeg(363257), bin(2539604), audio/vnd.dlna.adts(13530010), html(8779436), mp4a(3988517), html(710893), bin(2108773), mp4a(938780), mp4a(1632058), mp4a(1781328), jpeg(6006498), mp4a(2011577), png(1867628), jpeg(3578276), qt(1377580), bin(498661), jpeg(3959637), jpeg(3553188), mp4a(1566800), html(9536819), jpeg(1795067), bin(593638), jpeg(68405), jpeg(937156), jpeg(4183531), mpga(1488238), jpeg(864405), jpeg(1365686), docx(12339), jpeg(578317), xlsx(52077), html(523486), jpeg(7547441), mp4a(1930783), jpeg(58628), mp4a(1145760), jpeg(3167708), mp4(31660079), jpeg(2489302), mp4a(1666611), xlsx(82776), jpeg(1827086), jpeg(1844434), jpeg(4555773), jpeg(3299756), mp4a(1140725), mp4a(531377), mp4a(3139464), mp4(24994984), ogv(408137), jpeg(2440831), png(497108), xlsx(88927), jpeg(859100), jpeg(3121852), png(3396851), mp4a(337657), jpeg(1938676), mpga(3748682), jpeg(3010539), png(618010), jpeg(120170), mp4a(691616), jpeg(4782980), jpeg(1882397), mp4a(847950), mp4a(579012), jpeg(3477933), jpeg(3332206), jpeg(1777340), jpeg(1779300), jpeg(3324446), bin(2111272), jpeg(134273), jpeg(2327041), mp4a(2112621), jpeg(2028706), jpeg(2253098), jpeg(87256), jpeg(4748410), jpeg(2262473), mp4a(3061773), jpeg(3853660), jpeg(489701), jpeg(2016316), mp4(48601545), jpeg(4110324), mp4a(750884), mp4a(1666390), jpeg(2729939), jpeg(887373), pdf(122363), mp4a(760877), jpeg(5047594), jpeg(3513429), mp4a(701592), mp4a(24233), jpeg(3878593), jpeg(955964), jpeg(1959028), mp4a(573738), jpeg(1607988), jpeg(121889), mp4a(1115213), bin(1173798), jpeg(6732180), jpeg(1945789), jpeg(5423032), jpeg(252261), jpeg(3546392), jpeg(1587693), jpeg(1303230), jpeg(1050632), mp4a(2957441), mp4a(2682346), bin(564582), jpeg(117534), jpeg(417971), jpeg(3639631), jpeg(3283728), bin(234118), png(2037576), jpeg(3095107), png(1185912), jpeg(3003672), mp4a(1307438), jpeg(142223), jpeg(6401219), bin(2429287), jpeg(3129315), jpeg(111760), jpeg(749493), mpga(5172750), jpeg(67155), mp4a(1303543), audio/vnd.dlna.adts(4340557), jpeg(3978187), jpeg(2696452), mp4a(1505002), jpeg(1750030), jpeg(7505927), jpeg(2638934), jpeg(3812323), bin(818310), jpeg(571235), jpeg(3256481), mp4a(1374945), png(357625), jpeg(5542820), mp4a(1981377), mp4a(2469218), jpeg(4044906), jpeg(37019), jpeg(1134103), bin(632006), jpeg(85234), mp4(11623573), bin(1030438), audio/vnd.dlna.adts(11278413), mp4a(6956199), xlsx(48995), mp4a(10021109), xlsx(224948556), jpeg(41894), jpeg(85137), bin(3540340), jpeg(1280936), xlsx(189425), bin(546822), html(1075544), png(1790553), mp4a(8341651), mp4a(1347344), jpeg(1837571), qt(2398526), jpeg(488375), png(652644), bin(709318), mp4a(512559), jpeg(1660933), mp4a(903487), jpeg(2355965), jpeg(3175474), mp4a(3235128), pdf(213974), jpeg(3105125), mp4a(1264503), jpeg(817070), jpeg(2858948), bin(1019282), jpeg(3172013), jpeg(2118129), png(856929), jpeg(3172905), mp4a(2083812), jpeg(3950185), 3gp(4189257), webp(13654), jpeg(3985986), jpeg(22928), html(496815), jpeg(2221272), jpeg(4526887), jpeg(3917797), jpeg(1579597), jpeg(4260674), jpeg(3155291), jpeg(939502), jpeg(3169133), jpeg(68283), jpeg(145275), audio/vnd.dlna.adts(4820134), mp4a(1195465), html(1694054), jpeg(155887), mp4a(3274925), mp4a(4613589), mpga(2386117), jpeg(41185), mp4a(1086359), mp4a(1151555), bin(1960531), jpeg(2149916), jpeg(2564893), wmv(50197262), mp4(26601787), jpeg(1997912), jpeg(2729245), mp4a(729599), mpga(3484030), jpeg(4728142), jpeg(5043578), mp4a(873556), mp4a(660082), jpeg(13696858), mp4a(1555980), jpeg(45747), jpeg(3178887), qt(28706733), jpeg(4509448), bin(381126), mp4a(661507), jpeg(495339), jpeg(138394), jpeg(85114), mpga(1449626), mp4a(3615513), jpeg(6130051), mp4a(13214859), mp4a(1702996), mp4a(562777), jpeg(2551565), mp4a(1176775), jpeg(16753), mpga(1784266), jpeg(377428), jpeg(3136525), mp4a(1115669), jpeg(64481), mp4a(2548754), jpeg(32021), bin(3983879), jpeg(1629680), pdf(121390), jpeg(2243229), jpeg(3134307), html(38240607), jpeg(8644181), jpeg(4566822), mpga(379781), mp4a(2068903), jpeg(599871), mp4a(8995283), jpeg(2507441), bin(1544294), jpeg(254462), jpeg(1915392), jpeg(1595555), mp4a(1073809), jpeg(40514), jpeg(535219), mp4a(1617110), xlsx(20756300), bin(1869989), jpeg(2381586), jpeg(35883), mpga(4061915), jpeg(917468), jpeg(3052078), mp4a(1901851), jpeg(131612), jpeg(1507898), jpeg(130590), jpeg(133876), jpeg(180752), jpeg(3552912), jpeg(172352), mp4a(2419697), mp4a(331293), jpeg(1583799), jpeg(840041), mp4a(1611680), bin(328166), jpeg(219612), jpeg(1656656), jpeg(4653342), mp4a(5608105), jpeg(2201474), wav(2818960), mp4a(936086), pdf(91460), mp4a(1601130), jpeg(659500), jpeg(100391), jpeg(2812452), mp4a(5629529), jpeg(1816312), jpeg(71716), pdf(295280), jpeg(2911219), jpeg(2471054), docx(31188), jpeg(4659509), png(105272), mp4a(959231), mp4a(1516084), mpga(5970561), jpeg(3668632), mp4a(1739564), jpeg(2058883), jpeg(1901789), mp4a(3134928), mp4a(1152026), jpeg(3523727), mp4a(760909), mp4a(1248111), mp4a(984328), audio/vnd.dlna.adts(934543), jpeg(2193720), jpeg(1401200), bin(919270), jpeg(529647), mp4a(1608171), mp4a(5154628), jpeg(1040846), mp4a(2360919), mp4a(1273706), jpeg(1766662), mp4a(291843), jpeg(3199783), jpeg(4440461), mp4a(2354743), html(983166), jpeg(4653818), jpeg(3216327), jpeg(12340), png(24722), jpeg(68398), audio/vnd.dlna.adts(9495356), mp4a(1911363), jpeg(363586), jpeg(3277514), jpeg(2684588), png(795810), mp4a(1244456), jpeg(59161), jpeg(1603743), mp4a(611153), jpeg(2500101), jpeg(3468457), mp4a(843462), jpeg(4005962), mp4a(912224), 3gp(5920182), jpeg(1714504), jpeg(2280388), mpga(4640203), jpeg(3332571), mp4a(1269110), jpeg(1788844), mp4a(4350631), mp4a(1496135), bin(1772535), mpga(371534), jpeg(4221720), mp4a(1486515), mp4a(3758180), jpeg(3413660), jpeg(3451347), mp4(6993330), bin(152038), jpeg(3535829), jpeg(3234324), tiff(-1), jpeg(2251269), jpeg(2600986), bin(1606725), bin(1615540), jpeg(629961), mp4a(1364069), jpeg(849628), jpeg(2384630), jpeg(854035), jpeg(1059910), mp4a(432261), jpeg(6803436), qt(2010499), mp4a(1222788), png(252350), mp4a(561403), mp4a(1301355), jpeg(78430), jpeg(153294), jpeg(3111015), jpeg(3506560), mp4a(1614765), mp4a(4359255), mp4a(1609908), jpeg(3129756), jpeg(1440858), jpeg(24096), mpga(6606764), mp4a(219517), wav(16120364), mp4a(1071439), jpeg(3293381), jpeg(112899), jpeg(2875869), jpeg(4948125), mp4a(1615299), png(3496115), mp4a(1986411), png(586680), jpeg(1897709), jpeg(2273020), jpeg(4022260), jpeg(377213), mp4a(1702687), html(4191543), jpeg(1398077), jpeg(2079488), jpeg(31946), jpeg(1243971), jpeg(2389859), qt(574596), mp4a(532776), jpeg(2730221), mp4a(510562), jpeg(2968414), mp4a(2145487), jpeg(496123), jpeg(4274950), png(548620), jpeg(2124741), png(5709270), jpeg(5322032), mp4a(304846), jpeg(2969836), jpeg(5084546), jpeg(173417), mpga(2814171), pdf(308146), png(7879), png(2155793), jpeg(1568444), jpeg(107669), jpeg(3844552), jpeg(5050854), mp4(59931145), jpeg(26777), bin(3681626), mp4a(1124596), txt(186920), jpeg(520311), bin(416102), mp4a(7284061), jpeg(40281), jpeg(657555), png(1437413), jpeg(2534845), jpeg(445866), jpeg(1237900), jpeg(4250838), bin(156966), tsv(733), qt(3177780), bin(864966), jpeg(11690), mp4a(3045602), mp4a(2449349), bin(748148), jpeg(1825738), jpeg(1990482), mpga(1190436), mp4a(5845364), mp4a(1448064), jpeg(3171202), bin(2501650), jpeg(2273265), mp4a(619603), jpeg(951877), jpeg(63914), mp4a(1271334), jpeg(1976245), mpga(4817983), jpeg(331201), jpeg(129869), jpeg(7445743), jpeg(5717518), jpeg(2968114), mp4a(693312), mp4a(264471), jpeg(5399866), jpeg(71431), jpeg(1519243), jpeg(1593696), mp4(4106014), mp4a(705329), mp4a(1148157), jpeg(6046515), mp4a(916096), jpeg(333207), jpeg(3138702), jpeg(417572), mpga(5269701), jpeg(145637), mp4a(802505), png(1017305), jpeg(17907), jpeg(3598845), jpeg(1155643), jpeg(2638302), mp4a(822545), bin(1493618), bin(906790), jpeg(154930), jpeg(953837), zip(11659935), mp4a(1214837), mp4a(1016151), mp4a(3515351), mp4a(3839771), mp4a(1256085), jpeg(4031381), mpga(3309399), jpeg(290224), png(459262), jpeg(48326), jpeg(4736590), jpeg(1964763), jpeg(2042850), jpeg(14911972), jpeg(981139), mp4(8726495), jpeg(455010), mp4a(2202351), jpeg(72668), mpga(970535), jpeg(12825578), mp4a(1931894), jpeg(1726579), jpeg(3996799), jpeg(2413680), jpeg(2299059), png(1038072), mp4a(1467032), jpeg(732955), jpeg(145129), jpeg(4057705), jpeg(1575841), mpga(4266613), jpeg(3444896), mp4a(1095447), jpeg(2423812), 3gp(11381321), png(477408), mp4a(1358807), pdf(155079), jpeg(822164), mp4a(3978276), png(316363), jpeg(3336796), bin(1495558), jpeg(874390), jpeg(278529), jpeg(942247), pdf(129862), jpeg(4954268), jpeg(2572775), jpeg(3062482), qt(89399945), jpeg(2128499), jpeg(2849921), png(1019045), mp4a(3170368), mpga(4747435), jpeg(1371393), jpeg(3550211), mp4a(942819), jpeg(2313418), jpeg(4887470), jpeg(91125), mp4a(2439271), jpeg(2764753), mp4a(3002959), bin(729766), jpeg(798303), bin(2204684)Available download formats
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    Qualitative Data Repository
    Authors
    Sarah S. Willen; Sarah S. Willen; Katherine A. Mason; Katherine A. Mason
    License

    https://qdr.syr.edu/policies/qdr-restricted-access-conditionshttps://qdr.syr.edu/policies/qdr-restricted-access-conditions

    Time period covered
    May 29, 2020 - May 31, 2022
    Area covered
    Europe, Canada, Mexico, Central America, United States
    Description

    Project Summary This dataset contains all qualitative and quantitative data collected in the first phase of the Pandemic Journaling Project (PJP). PJP is a combined journaling platform and interdisciplinary, mixed-methods research study developed by two anthropologists, with support from a team of colleagues and students across the social sciences, humanities, and health fields. PJP launched in Spring 2020 as the COVID-19 pandemic was emerging in the United States. PJP was created in order to “pre-design an archive” of COVID-19 narratives and experiences open to anyone around the world. The project is rooted in a commitment to democratizing knowledge production, in the spirit of “archival activism” and using methods of “grassroots collaborative ethnography” (Willen et al. 2022; Wurtz et al. 2022; Zhang et al 2020; see also Carney 2021). The motto on the PJP website encapsulates these commitments: “Usually, history is written only by the powerful. When the history of COVID-19 is written, let’s make sure that doesn’t happen.” (A version of this Project Summary with links to the PJP website and other relevant sites is included in the public documentation of the project at QDR.) In PJP’s first phase (PJP-1), the project provided a digital space where participants could create weekly journals of their COVID-19 experiences using a smartphone or computer. The platform was designed to be accessible to as wide a range of potential participants as possible. Anyone aged 15 or older, living anywhere in the world, could create journal entries using their choice of text, images, and/or audio recordings. The interface was accessible in English and Spanish, but participants could submit text and audio in any language. PJP-1 ran on a weekly basis from May 2020 to May 2022. Data Overview This Qualitative Data Repository (QDR) project contains all journal entries and closed-ended survey responses submitted during PJP-1, along with accompanying descriptive and explanatory materials. The dataset includes individual journal entries and accompanying quantitative survey responses from more than 1,800 participants in 55 countries. Of nearly 27,000 journal entries in total, over 2,700 included images and over 300 are audio files. All data were collected via the Qualtrics survey platform. PJP-1 was approved as a research study by the Institutional Review Board (IRB) at the University of Connecticut. Participants were introduced to the project in a variety of ways, including through the PJP website as well as professional networks, PJP’s social media accounts (on Facebook, Instagram, and Twitter) , and media coverage of the project. Participants provided a single piece of contact information — an email address or mobile phone number — which was used to distribute weekly invitations to participate. This contact information has been stripped from the dataset and will not be accessible to researchers. PJP uses a mixed-methods research approach and a dynamic cohort design. After enrolling in PJP-1 via the project’s website, participants received weekly invitations to contribute to their journals via their choice of email or SMS (text message). Each weekly invitation included a link to that week’s journaling prompts and accompanying survey questions. Participants could join at any point, and they could stop participating at any point as well. They also could stop participating and later restart. Retention was encouraged with a monthly raffle of three $100 gift cards. All individuals who had contributed that month were eligible. Regardless of when they joined, all participants received the project’s narrative prompts and accompanying survey questions in the same order. In Week 1, before contributing their first journal entries, participants were presented with a baseline survey that collected demographic information, including political leanings, as well as self-reported data about COVID-19 exposure and physical and mental health status. Some of these survey questions were repeated at periodic intervals in subsequent weeks, providing quantitative measures of change over time that can be analyzed in conjunction with participants' qualitative entries. Surveys employed validated questions where possible. The core of PJP-1 involved two weekly opportunities to create journal entries in the format of their choice (text, image, and/or audio). Each week, journalers received a link with an invitation to create one entry in response to a recurring narrative prompt (“How has the COVID-19 pandemic affected your life in the past week?”) and a second journal entry in response to their choice of two more tightly focused prompts. Typically the pair of prompts included one focusing on subjective experience (e.g., the impact of the pandemic on relationships, sense of social connectedness, or mental health) and another with an external focus (e.g., key sources of scientific information, trust in government, or COVID-19’s economic impact). Each week,...

  7. LLM prompts in the context of machine learning

    • kaggle.com
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Nelson (2024). LLM prompts in the context of machine learning [Dataset]. https://www.kaggle.com/datasets/jordanln/llm-prompts-in-the-context-of-machine-learning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Kaggle
    Authors
    Jordan Nelson
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is an extension of my previous work on creating a dataset for natural language processing tasks. It leverages binary representation to characterise various machine learning models. The attributes in the dataset are derived from a dictionary, which was constructed from a corpus of prompts typically provided to a large language model (LLM). These prompts reference specific machine learning algorithms and their implementations. For instance, consider a user asking an LLM or a generative AI to create a Multi-Layer Perceptron (MLP) model for a particular application. By applying this concept to multiple machine learning models, we constructed our corpus. This corpus was then transformed into the current dataset using a bag-of-words approach. In this dataset, each attribute corresponds to a word from our dictionary, represented as a binary value: 1 indicates the presence of the word in a given prompt, and 0 indicates its absence. At the end of each entry, there is a label. Each entry in the dataset pertains to a single class, where each class represents a distinct machine learning model or algorithm. This dataset is intended for multi-class classification tasks, not multi-label classification, as each entry is associated with only one label and does not belong to multiple labels simultaneously. This dataset has been utilised with a Convolutional Neural Network (CNN) using the Keras Automodel API, achieving impressive training and testing accuracy rates exceeding 97%. Post-training, the model's predictive performance was rigorously evaluated in a production environment, where it continued to demonstrate exceptional accuracy. For this evaluation, we employed a series of questions, which are listed below. These questions were intentionally designed to be similar to ensure that the model can effectively distinguish between different machine learning models, even when the prompts are closely related.

    KNN How would you create a KNN model to classify emails as spam or not spam based on their content and metadata? How could you implement a KNN model to classify handwritten digits using the MNIST dataset? How would you use a KNN approach to build a recommendation system for suggesting movies to users based on their ratings and preferences? How could you employ a KNN algorithm to predict the price of a house based on features such as its location, size, and number of bedrooms etc? Can you create a KNN model for classifying different species of flowers based on their petal length, petal width, sepal length, and sepal width? How would you utilise a KNN model to predict the sentiment (positive, negative, or neutral) of text reviews or comments? Can you create a KNN model for me that could be used in malware classification? Can you make me a KNN model that can detect a network intrusion when looking at encrypted network traffic? Can you make a KNN model that would predict the stock price of a given stock for the next week? Can you create a KNN model that could be used to detect malware when using a dataset relating to certain permissions a piece of software may have access to?

    Decision Tree Can you describe the steps involved in building a decision tree model to classify medical images as malignant or benign for cancer diagnosis and return a model for me? How can you utilise a decision tree approach to develop a model for classifying news articles into different categories (e.g., politics, sports, entertainment) based on their textual content? What approach would you take to create a decision tree model for recommending personalised university courses to students based on their academic strengths and weaknesses? Can you describe how to create a decision tree model for identifying potential fraud in financial transactions based on transaction history, user behaviour, and other relevant data? In what ways might you apply a decision tree model to classify customer complaints into different categories determining the severity of language used? Can you create a decision tree classifier for me? Can you make me a decision tree model that will help me determine the best course of action across a given set of strategies? Can you create a decision tree model for me that can recommend certain cars to customers based on their preferences and budget? How can you make a decision tree model that will predict the movement of star constellations in the sky based on data provided by the NASA website? How do I create a decision tree for time-series forecasting?

    Random Forest Can you describe the steps involved in building a random forest model to classify different types of anomalies in network traffic data for cybersecurity purposes and return the code for me? In what ways could you implement a random forest model to predict the severity of traffic congestion in urban areas based on historical traffic patterns, weather...

  8. Z

    Dataset for paper "Mitigating the effect of errors in source parameters on...

    • data.niaid.nih.gov
    Updated Sep 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Rawlinson (2022). Dataset for paper "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6969601
    Explore at:
    Dataset updated
    Sep 28, 2022
    Dataset provided by
    Nienke Blom
    Phil-Simon Hardalupas
    Nicholas Rawlinson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).

    This dataset contains:

    The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.

    A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.

    A number of Python scripts that are used in above notebooks.

    two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.

    An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .

    Datasets corresponding to the different figures.

    One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020

    One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).

    A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:

    travel-time arrival predictions from every source to all stations (df_stations...pkl)

    misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)

    addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type

    How to use this dataset:

    To set up the conda environment:

    make sure you have anaconda/miniconda

    make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.

    Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.

    Additionally in your conda env, install basemap and cartopy:

    conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy

    Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.

    To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.

    Figure 1: separate notebook, Fig1_event_98.py

    Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py

    Figures 3-7: Figures_perturbation_study.py

    Figures 8-10: Figures_toy_inversions.py

    To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.

    To recreate the complete Salvus project: This can be done using:

    the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)

    the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py

    For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.

    References:

    Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469

    Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020

    Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902

  9. MetaGraspNet Difficulty 1

    • kaggle.com
    zip
    Updated Mar 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuhao Chen (2022). MetaGraspNet Difficulty 1 [Dataset]. https://www.kaggle.com/datasets/metagrasp/metagraspnetdifficulty1-easy
    Explore at:
    zip(4103890817 bytes)Available download formats
    Dataset updated
    Mar 19, 2022
    Authors
    Yuhao Chen
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    MetaGraspNet dataset

    This repository contains the MetaGraspNet Dataset described in the paper "MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis" (https://arxiv.org/abs/2112.14663 ).

    There has been increasing interest in smart factories powered by robotics systems to tackle repetitive, laborious tasks. One particular impactful yet challenging task in robotics-powered smart factory applications is robotic grasping: using robotic arms to grasp objects autonomously in different settings. Robotic grasping requires a variety of computer vision tasks such as object detection, segmentation, grasp prediction, pick planning, etc. While significant progress has been made in leveraging of machine learning for robotic grasping, particularly with deep learning, a big challenge remains in the need for large-scale, high-quality RGBD datasets that cover a wide diversity of scenarios and permutations.

    To tackle this big, diverse data problem, we are inspired by the recent rise in the concept of metaverse, which has greatly closed the gap between virtual worlds and the physical world. In particular, metaverses allow us to create digital twins of real-world manufacturing scenarios and to virtually create different scenarios from which large volumes of data can be generated for training models. We present MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types, and is split into 5 difficulties to evaluate object detection and segmentation model performance in different grasping scenarios. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance in a manner that is more appropriate for robotic grasp applications compared to existing general-purpose performance metrics. This repository contains the first phase of MetaGraspNet benchmark dataset which includes detailed object detection, segmentation, layout annotations, and a script for layout-weighted performance metric (https://github.com/y2863/MetaGraspNet ).

    https://raw.githubusercontent.com/y2863/MetaGraspNet/main/.github/500.png">

    Citing MetaGraspNet

    If you use MetaGraspNet dataset or metric in your research, please use the following BibTeX entry. BibTeX @article{chen2021metagraspnet, author = {Yuhao Chen and E. Zhixuan Zeng and Maximilian Gilles and Alexander Wong}, title = {MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis}, journal = {arXiv preprint arXiv:2112.14663}, year = {2021} }

    File Structure

    This dataset is arranged in the following file structure:

    root
    |-- meta-grasp
      |-- scene0
        |-- 0_camera_params.json
        |-- 0_depth.png
        |-- 0_rgb.png
        |-- 0_order.csv
        ...
      |-- scene1
      ...
    |-- difficulty-n-coco-label.json
    

    Each scene is an unique arrangement of objects, which we then display at various different angles. For each shot of a scene, we provide the camera parameters (x_camara_params.json), a depth image (x_depth.png), an rgb image (x_rgb.png), as well as a matrix representation of the ordering of each object (x_order.csv). The full label for the image are all available in difficulty-n-coco-label.json (where n is the difficulty level of the dataset) in the coco data format.

    Understanding order.csv

    The matrix describes a pairwise obstruction relationship between each object within the image. Given a "parent" object covering a "child" object: relationship_matrix[child_id, parent_id] = -1

  10. d

    NYC Parks Events Listing – Event Organizers

    • catalog.data.gov
    • s.cnmilf.com
    Updated Oct 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2023). NYC Parks Events Listing – Event Organizers [Dataset]. https://catalog.data.gov/dataset/nyc-parks-events-listing-event-organizers
    Explore at:
    Dataset updated
    Oct 13, 2023
    Dataset provided by
    data.cityofnewyork.us
    Area covered
    New York
    Description

    The NYC Parks Events Listing database is used to store event information displayed on the Parks website, nyc.gov/parks. There are seven related tables that make up the this database: Events_Events table (This is the primary table that contains basic data about every event. Each record is an event.) Events_Categories (Each record is a category describing an event. One event can be in more than one category.) Events_Images (Each record is an image related to an event. One event can have more than one image.) Events_Links (Each record is a link with more information about an event. One event can have more than one link.) Events_Locations (Each record is a location where an event takes place. One event can have more than one location.) Events_Organizers (Each record contains a group or person organizing an event. One event can have more than one organizer.) Events_YouTube (Each record is a link to a YouTube video about an event. One event can have more than one YouTube video.) The Events_Events table is the primary table. All other tables can be related by joining on the event_id. This data contains records from 2013 and on. For a complete list of related datasets, please follow This Link

  11. P

    How can I make a reservation with United Airlines? Dataset

    • paperswithcode.com
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). How can I make a reservation with United Airlines? Dataset [Dataset]. https://paperswithcode.com/dataset/how-can-i-make-a-reservation-with-united
    Explore at:
    Dataset updated
    Jun 23, 2025
    Description

    Q1: What is the best time to book a United Airlines flight for a lower fare?

    ☎️+1 (888) 706-5253 is the number to call if you're trying to find the best time to book United Airlines flights for lower fares. Historically, the best time to book is about 1 to 3 months before your departure for domestic flights and 2 to 8 months ahead for international routes. Prices often dip mid-week, especially on Tuesdays and Wednesdays, so calling ☎️+1 (888) 706-5253 during these windows can give you a better shot at scoring deals.

    ☎️+1 (888) 706-5253 is also helpful when trying to align with United’s flash sales or limited-time promotions that aren't always well-advertised. Travel experts at this number can guide you on fare trends based on your specific routes. For maximum savings, it’s recommended you call ☎️+1 (888) 706-5253 and ask about fare prediction tools or price alerts.

    Flexibility is key—being open to alternative dates or nearby airports can save you hundreds. Contact ☎️+1 (888) 706-5253 for help navigating these options and optimizing your itinerary. The agents at ☎️+1 (888) 706-5253 can also assist in combining miles with fare sales for even deeper savings.

    Customer Reviews:

    Sophia R., ⭐⭐⭐⭐⭐: “Saved $220 just by booking after a quick call to ☎️+1 (888) 706-5253. Amazing tips!”

    Jason M., ⭐⭐⭐⭐: “Helpful reps, gave me the perfect time to book. Thanks, ☎️+1 (888) 706-5253!”

    Emily T., ⭐⭐⭐⭐⭐: “Would’ve missed the sale if not for ☎️+1 (888) 706-5253—highly recommended!”

    Q2: What is the fastest way to book a United Airlines flight?

    ☎️+1 (888) 706-5253 is the fastest and most reliable number to call when you're in a rush to book a United Airlines flight. Speaking to a live travel advisor ensures real-time support, immediate confirmation, and often access to unpublished fares. Whether you're booking last minute or just want everything handled efficiently, call ☎️+1 (888) 706-5253 to complete your booking in minutes.

    ☎️+1 (888) 706-5253 agents can streamline your booking process by handling flight selection, baggage add-ons, seat preferences, and even travel insurance in one conversation. If your travel plans are urgent or complex, ☎️+1 (888) 706-5253 is much faster than navigating the United Airlines website or mobile app on your own.

    Booking through ☎️+1 (888) 706-5253 can also help avoid mistakes like date errors or missed discounts, which can happen with rushed online bookings. Don’t waste time—call ☎️+1 (888) 706-5253 for fast, accurate booking with immediate support.

    Customer Reviews:

    Kevin L., ⭐⭐⭐⭐⭐: “Needed a flight in an hour—booked in 10 minutes via ☎️+1 (888) 706-5253!”

    Laura B., ⭐⭐⭐⭐: “Super fast and friendly. No hold time. Thanks ☎️+1 (888) 706-5253!”

    Chris D., ⭐⭐⭐⭐⭐: “Quickest way to book. Great support from ☎️+1 (888) 706-5253!”

    Q3: How do I talk to someone at United Airlines about a booking?

    ☎️+1 (888) 706-5253 is your direct link to speak with someone about any United Airlines booking issues or questions. Whether you're modifying an existing itinerary, checking baggage rules, or confirming travel credits, the agents at ☎️+1 (888) 706-5253 can assist immediately. It’s far quicker than navigating United’s automated phone system.

    You can avoid long wait times and confusion by reaching out to ☎️+1 (888) 706-5253, where travel specialists are trained to handle United Airlines reservations. Their expertise helps resolve booking complications, ticket upgrades, name corrections, and more. When urgent help is needed, ☎️+1 (888) 706-5253 is far superior to trying to resolve issues online.

    Even if you booked your flight elsewhere, the team at ☎️+1 (888) 706-5253 can act as a go-between with United and advocate for changes or refunds. For real answers and human help, call ☎️+1 (888) 706-5253 right away.

    Customer Reviews:

    Hannah G., ⭐⭐⭐⭐⭐: “No more long holds with United. Just called ☎️+1 (888) 706-5253—they fixed it fast!”

    James K., ⭐⭐⭐⭐: “Better than calling the airline directly. ☎️+1 (888) 706-5253 got me a solution in 5 mins.”

    Maya P., ⭐⭐⭐⭐⭐: “Amazing customer care from ☎️+1 (888) 706-5253. Highly recommend for booking support.”

    Q4: Can I call United Airlines for reservations quickly?

    ☎️+1 (888) 706-5253 is your best option to make a quick call and get United Airlines reservations handled immediately. Unlike the standard United call center, this number connects you directly with live agents who specialize in rapid booking. For those in a hurry or looking to avoid hold times, ☎️+1 (888) 706-5253 is the fastest route.

    With ☎️+1 (888) 706-5253, you can finalize a reservation in one call—no waiting for online confirmations or bouncing between web pages. Whether you're booking for a solo trip or a group, calling ☎️+1 (888) 706-5253 ensures the reservation process is smooth and efficient from start to finish.

    Quick reservations also come with expert advice when you call ☎️+1 (888) 706-5253, including seat upgrades, fare classes, and best departure times. Save time and stress by making United reservations through ☎️+1 (888) 706-5253 anytime, even on weekends.

    Customer Reviews:

    Rachel N., ⭐⭐⭐⭐⭐: “Took less than 6 minutes—awesome team at ☎️+1 (888) 706-5253.”

    Eric W., ⭐⭐⭐⭐: “Booking was fast and hassle-free with ☎️+1 (888) 706-5253.”

    Diana S., ⭐⭐⭐⭐⭐: “Perfect for last-minute plans. I trust ☎️+1 (888) 706-5253 every time.”

    Q5: How can I make a reservation with United Airlines?

    ☎️+1 (888) 706-5253 is the ideal number to call if you want to make a reservation with United Airlines without the confusion of online portals. The booking agents at ☎️+1 (888) 706-5253 will walk you through each step, from flight selection to payment. This ensures you get the best fare and schedule options in one simple conversation.

    Booking through ☎️+1 (888) 706-5253 also gives you access to fare bundles, seat upgrades, and flexible cancellation policies. If you're unsure about dates or need to coordinate with other travelers, the team at ☎️+1 (888) 706-5253 will accommodate your needs and explain your options clearly.

    For families, business travelers, or vacationers, calling ☎️+1 (888) 706-5253 is a smarter, more personalized way to reserve flights. Don’t take chances with online errors—get real-time support and secure your United reservation fast by calling ☎️+1 (888) 706-5253 now.

    Customer Reviews:

    Liam F., ⭐⭐⭐⭐⭐: “Best reservation experience I’ve had. ☎️+1 (888) 706-5253 made it so easy.”

    Ava C., ⭐⭐⭐⭐: “Got extra legroom and meal add-ons—thanks ☎️+1 (888) 706-5253!”

    Noah J., ⭐⭐⭐⭐⭐: “Never booking online again. ☎️+1 (888) 706-5253 is my go-to.”

  12. P

    @#@#Can I make a reservation by calling Lufthansa Airlines? Dataset

    • paperswithcode.com
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). @#@#Can I make a reservation by calling Lufthansa Airlines? Dataset [Dataset]. https://paperswithcode.com/dataset/can-i-make-a-reservation-by-calling-lufthansa
    Explore at:
    Dataset updated
    Jul 2, 2025
    Description

    Yes, you absolutely can make a reservation by calling Lufthansa Airlines directly. ✈️📞+1(877) 471-1812 This method is convenient, reliable, and efficient, especially for travelers who prefer human interaction. ✈️📞+1(877) 471-1812 Calling Lufthansa ensures that your queries are answered in real time, and any custom requests are accurately documented. ✈️📞+1(877) 471-1812

    When you dial ✈️📞+1(877) 471-1812, you are immediately connected with a Lufthansa booking expert. ✈️📞+1(877) 471-1812 These representatives are professionally trained to handle all types of reservations, from economy to first class. ✈️📞+1(877) 471-1812 Whether it’s a one-way trip, round trip, or multi-city itinerary, booking by phone is fully supported. ✈️📞+1(877) 471-1812

    The phone reservation process is user-friendly and personalized. ✈️📞+1(877) 471-1812 You’ll be asked for your travel dates, destination, number of passengers, and preferred class. ✈️📞+1(877) 471-1812 The agent can then provide you with options, explain the fare types, and even suggest route alternatives. ✈️📞+1(877) 471-1812

    If you have special requirements like extra baggage, wheelchair assistance, or meal preferences, booking by phone helps. ✈️📞+1(877) 471-1812 You can communicate your needs clearly to a live representative. ✈️📞+1(877) 471-1812 This reduces the risk of booking errors or missing essential accommodations for your journey. ✈️📞+1(877) 471-1812

    Senior citizens, families, and first-time flyers often find phone booking more reassuring. ✈️📞+1(877) 471-1812 The agent explains every detail and ensures you understand terms, conditions, and cancellation policies. ✈️📞+1(877) 471-1812 If you’re not tech-savvy or don’t feel confident booking online, calling is the best route. ✈️📞+1(877) 471-1812

    Lufthansa phone agents can also access promotions and unpublished fares. ✈️📞+1(877) 471-1812 Sometimes, better deals are available by phone that you won’t see online. ✈️📞+1(877) 471-1812 Especially during seasonal sales or if you’re booking on short notice, speaking directly may help you save. ✈️📞+1(877) 471-1812

    You’ll also receive your booking confirmation via email or SMS after a successful phone reservation. ✈️📞+1(877) 471-1812 This provides instant peace of mind and allows you to review all the details. ✈️📞+1(877) 471-1812 If there’s anything incorrect, you can call back to fix it immediately. ✈️📞+1(877) 471-1812

    Another advantage is the ability to hold a reservation temporarily. ✈️📞+1(877) 471-1812 In some cases, the agent may allow you to hold a fare for 24 hours. ✈️📞+1(877) 471-1812 This gives you time to finalize your plans before committing to the purchase. ✈️📞+1(877) 471-1812

    Phone reservations are also ideal for corporate or group travel. ✈️📞+1(877) 471-1812 Lufthansa offers special services and discounts for large bookings, often best arranged by speaking to a representative. ✈️📞+1(877) 471-1812 These bookings may include additional benefits like seat assignments or flexible change policies. ✈️📞+1(877) 471-1812

    If you’re an elite flyer or Miles & More member, calling Lufthansa allows you to apply points easily. ✈️📞+1(877) 471-1812 Agents can check your balance and redeem them for upgrades or ticket purchases. ✈️📞+1(877) 471-1812 This personalized service adds value to your loyalty membership and streamlines the process. ✈️📞+1(877) 471-1812

    In conclusion, yes—you can make a Lufthansa reservation by calling. ✈️📞+1(877) 471-1812 This method offers personalized service, fast processing, and human support that many travelers find invaluable. ✈️📞+1(877) 471-1812 If you're ready to book or have questions, dial ✈️📞+1(877) 471-1812 for assistance today.

  13. u

    Data from: Lending Club loan dataset for granting models

    • produccioncientifica.ucm.es
    • portalcientifico.uah.es
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ariza-Garzón, Miller Janny; Sanz-Guerrero, Mario; Arroyo Gallardo, Javier; Lending Club; Ariza-Garzón, Miller Janny; Sanz-Guerrero, Mario; Arroyo Gallardo, Javier; Lending Club (2024). Lending Club loan dataset for granting models [Dataset]. https://produccioncientifica.ucm.es/documentos/668fc499b9e7c03b01be2366?lang=ca
    Explore at:
    Dataset updated
    2024
    Authors
    Ariza-Garzón, Miller Janny; Sanz-Guerrero, Mario; Arroyo Gallardo, Javier; Lending Club; Ariza-Garzón, Miller Janny; Sanz-Guerrero, Mario; Arroyo Gallardo, Javier; Lending Club
    Description

    Lending Club offers peer-to-peer (P2P) loans through a technological platform for various personal finance purposes and is today one of the companies that dominate the US P2P lending market. The original dataset is publicly available on Kaggle and corresponds to all the loans issued by Lending Club between 2007 and 2018. The present version of the dataset is for constructing a granting model, that is, a model designed to make decisions on whether to grant a loan based on information available at the time of the loan application. Consequently, our dataset only has a selection of variables from the original one, which are the variables known at the moment the loan request is made. Furthermore, the target variable of a granting model represents the final status of the loan, that are "default" or "fully paid". Thus, we filtered out from the original dataset all the loans in transitory states. Our dataset comprises 1,347,681 records or obligations (approximately 60% of the original) and it was also cleaned for completeness and consistency (less than 1% of our dataset was filtered out).

    TARGET VARIABLE

    The dataset includes a target variable based on the final resolution of the credit: the default category corresponds to the event charged off and the non-default category to the event fully paid. It does not consider other values in the loan status variable since this variable represents the state of the loan at the end of the considered time window. Thus, there are no loans in transitory states. The original dataset includes the target variable “loan status”, which contains several categories ('Fully Paid', 'Current', 'Charged Off', 'In Grace Period', 'Late (31-120 days)', 'Late (16-30 days)', 'Default'). However, in our dataset, we just consider loans that are either “Fully Paid” or “Default” and transform this variable into a binary variable called “Default”, with a 0 for fully paid loans and a 1 for defaulted loans.

    EXPLANATORY VARIABLES

    The explanatory variables that we use correspond only to the information available at the time of the application. Variables such as the interest rate, grade, or subgrade are generated by the company as a result of a credit risk assessment process, so they were filtered out from the dataset as they must not be considered in risk models to predict the default in granting of credit.

    FULL LIST OF VARIABLES

    Loan identification variables:

    id: Loan id (unique identifier).

    issue_d: Month and year in which the loan was approved.

    Quantitative variables:

    revenue: Borrower's self-declared annual income during registration.

    dti_n: Indebtedness ratio for obligations excluding mortgage. Monthly information. This ratio has been calculated considering the indebtedness of the whole group of applicants. It is estimated as the ratio calculated using the co-borrowers’ total payments on the total debt obligations divided by the co-borrowers’ combined monthly income.

    loan_amnt: Amount of credit requested by the borrower.

    fico_n: Defined between 300 and 850, reported by Fair Isaac Corporation as a risk measure based on historical credit information reported at the time of application. This value has been calculated as the average of the variables “fico_range_low” and “fico_range_high” in the original dataset.

    experience_c: Binary variable that indicates whether the borrower is new to the entity. This variable is constructed from the credit date of the previous obligation in LC and the credit date of the current obligation; if the difference between dates is positive, it is not considered as a new experience with LC.

    Categorical variables:

    emp_length: Categorical variable with the employment length of the borrower (includes the no information category)

    purpose: Credit purpose category for the loan request.

    home_ownership_n: Homeownership status provided by the borrower in the registration process. Categories defined by LC: “mortgage”, “rent”, “own”, “other”, “any”, “none”. We merged the categories “other”, “any” and “none” as “other”.

    addr_state: Borrower's residence state from the USA.

    zip_code: Zip code of the borrower's residence.

    Textual variables

    title: Title of the credit request description provided by the borrower.

    desc: Description of the credit request provided by the borrower.

    We cleaned the textual variables. First, we removed all those descriptions that contained the default description provided by Lending Club on its web form (“Tell your story. What is your loan for?”). Moreover, we removed the prefix “Borrower added on DD/MM/YYYY >” from the descriptions to avoid any temporal background on them. Finally, as these descriptions came from a web form, we substituted all the HTML elements by their character (e.g. “&” was substituted by “&”, “<” was substituted by “<”, etc.).

    RELATED WORKS

    This dataset has been used in the following academic articles:

    Sanz-Guerrero, M. Arroyo, J. (2024). Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending. arXiv preprint arXiv:2401.16458. https://doi.org/10.48550/arXiv.2401.16458

    Ariza-Garzón, M.J., Arroyo, J., Caparrini, A., Segovia-Vargas, M.J. (2020). Explainability of a machine learning granting scoring model in peer-to-peer lending. IEEE Access 8, 64873 - 64890. https://doi.org/10.1109/ACCESS.2020.2984412

  14. Z

    Art&Emotions Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Bosca (2023). Art&Emotions Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8296749
    Explore at:
    Dataset updated
    Aug 29, 2023
    Dataset authored and provided by
    Alessio Bosca
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Art&Emotion experiment description

    The Art & Emotions dataset was collected in the scope of EU funded research project SPICE (https://cordis.europa.eu/project/id/870811) with the goal of investigating the relationship between art and emotions and collecting written data (User Generated Content) in the domain of arts in all the languages of the SPICE project (fi, en, es, he, it). The data was collected through a set of Google Forms (one for each language) and it was used in the project (along the other datasets collected by museums in the different project use cases) in order to train and test Emotion Detection Models within the project.

    The experiment consists of 12 artworks, chosen from a group of artworks provided by the GAM Museum of Turin (https://www.gamtorino.it/) one of the project partners. Each artwork is presented in a different section of the form; for each of the artworks, the user is asked to answer 5 open questions:

    1. What do you see in this picture? Write what strikes you most in this image.

    2. What does this artwork make you think about? Write the thoughts and memories that the picture evokes.

    3. How does this painting make you feel? Write the feelings and emotions that the picture evokes in you

    4. What title would you give to this artwork?

    5. Now choose one or more emoji to associate with your feelings looking at this artwork. You can also select "other" and insert other emojis by copying them from this link: https://emojipedia.org/

    For each of the artworks, the user can decide whether to skip to the next artwork, if he does not like the one in front of him or go back to the previous artworks and modify the answers. It is not mandatory to fill all the questions for a given artwork.

    The question about emotions is left open so as not to force the person to choose emotions from a list of tags which are the tags of a model (e.g. Plutchik), but leaving him free to express the different shades of emotions that can be felt.

    Before getting to the heart of the experiment, with the artworks sections, the user is asked to leave some personal information (anonymously), to help us getting an idea of the type of users who participated in the experiment.

    The questions are:

    Age (open)

    Gender (male, female, prefer not to say, other (open))

    How would you define your relationship with art?

    My job is related to the art world

    I am passionate about the art

    I am a little interested in art

    I am not interested in art

      4. Do you like going to museums or art exhibitions? 
    

    I like to visit museums frequently

    I go occasionally to museums or art exhibitions

    I rarely visit museums or art exhibitions

    Dataset structure:

    FI.csv: form data (personal data + open questions) in Finnish (UTF-8)

    EN.csv: form data (personal data + open questions) in English (UTF-8)

    ES.csv: form data (personal data + open questions) in Spanish (UTF-8)

    HE.csv: form data (personal data + open questions) in Hebrew (UTF-8)

    IT.csv: form data (personal data + open questions) in Italian (UTF-8)

    artworks.csv: the list of artworks including title, author, picture name (the pictures can be found in pictures.zip) and the mapping between the columns in the form data and the questions about that artwork

    pictures.zip: the jpeg of the artworks

  15. C

    Hospital Annual Financial Data - Selected Data & Pivot Tables

    • data.chhs.ca.gov
    • data.ca.gov
    • +4more
    csv, data, doc, html +4
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Access and Information (2025). Hospital Annual Financial Data - Selected Data & Pivot Tables [Dataset]. https://data.chhs.ca.gov/dataset/hospital-annual-financial-data-selected-data-pivot-tables
    Explore at:
    xlsx(763636), xls(18445312), pdf(303198), xlsx, doc, data, xlsx(754073), xls(920576), pdf(383996), xlsx(769128), xlsx(768036), xls(16002048), xlsx(750199), xls(44933632), xlsx(752914), xls(51424256), pdf(310420), html, xls(14657536), xlsx(765216), xlsx(770931), xls(44967936), pdf(258239), pdf(121968), xlsx(14714368), xls(19650048), xlsx(756356), xls, pdf(333268), xlsx(758089), xls(51554816), xlsx(758376), xls(18301440), csv(205488092), zip, xls(19625472), xlsx(782546), xlsx(790979), xlsx(771275), xlsx(777616), xlsx(779866), xls(19577856), xls(19599360)Available download formats
    Dataset updated
    Apr 23, 2025
    Dataset authored and provided by
    Department of Health Care Access and Information
    Description

    On an annual basis (individual hospital fiscal year), individual hospitals and hospital systems report detailed facility-level data on services capacity, inpatient/outpatient utilization, patients, revenues and expenses by type and payer, balance sheet and income statement.

    Due to the large size of the complete dataset, a selected set of data representing a wide range of commonly used data items, has been created that can be easily managed and downloaded. The selected data file includes general hospital information, utilization data by payer, revenue data by payer, expense data by natural expense category, financial ratios, and labor information.

    There are two groups of data contained in this dataset: 1) Selected Data - Calendar Year: To make it easier to compare hospitals by year, hospital reports with report periods ending within a given calendar year are grouped together. The Pivot Tables for a specific calendar year are also found here. 2) Selected Data - Fiscal Year: Hospital reports with report periods ending within a given fiscal year (July-June) are grouped together.

  16. f

    Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    figshare
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  17. R

    Hard Hat Workers Object Detection Dataset - resize-416x416-reflectEdges

    • public.roboflow.com
    zip
    Updated Sep 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Northeastern University - China (2022). Hard Hat Workers Object Detection Dataset - resize-416x416-reflectEdges [Dataset]. https://public.roboflow.com/object-detection/hard-hat-workers/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 30, 2022
    Dataset authored and provided by
    Northeastern University - China
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Bounding Boxes of Workers
    Description

    Overview

    The Hard Hat dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.

    The original dataset has a 75/25 train-test split.

    Example Image: https://i.imgur.com/7spoIJT.png" alt="Example Image">

    Use Cases

    One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.

    Using this Dataset

    Use the fork or Download this Dataset button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

    Dataset Versions:

    Image Preprocessing | Image Augmentation | Modify Classes * v1 (resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations * v2 (raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images * v3 (v3): generated with the original 75/25 train-test split | Modify Classes used to drop person class | Preprocessing and Augmentation applied * v5 (raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class * v8 (raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and person classes * v9 (raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and helmet classes * v10 (raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images * v11 (augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model * v12 (augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Fast Model * v13 (augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Accurate Model * v14 (raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class, and remap/relabel helmet class to head

    Choosing Between Computer Vision Model Sizes | Roboflow Train

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

    Roboflow Workmark

  18. COVID-19 Case Surveillance Public Use Data

    • data.cdc.gov
    • paperswithcode.com
    • +5more
    application/rdfxml +5
    Updated Jul 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
    Explore at:
    application/rdfxml, tsv, csv, json, xml, application/rssxmlAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Authors
    CDC Data, Analytics and Visualization Task Force
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

    Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

    This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

    CDC has three COVID-19 case surveillance datasets:

    The following apply to all three datasets:

    Overview

    The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

    For more information: NNDSS Supports the COVID-19 Response | CDC.

    The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

    COVID-19 Case Reports

    COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

    All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

    Data are Considered Provisional

    • The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
    • Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
    • Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

    Data Limitations

    To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

    Data Quality Assurance Procedures

    CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:

    • Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
    • Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
    • Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

    Data Suppression

    To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

    For questions, please contact Ask SRRG (eocevent394@cdc.gov).

    Additional COVID-19 Data

    COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These

  19. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  20. CASAS Smart Home dataset - scripted activities, with and without activity...

    • zenodo.org
    png, zip
    Updated Jun 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diane Cook; Diane Cook (2025). CASAS Smart Home dataset - scripted activities, with and without activity errors [Dataset]. http://doi.org/10.5281/zenodo.15712834
    Explore at:
    zip, pngAvailable download formats
    Dataset updated
    Jun 21, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Diane Cook; Diane Cook
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Measurement technique
    <p><strong>Citation</strong>: Please cite the following paper when using this dataset:<br>Cook, D. & Schmitter-Edgecombe, M. (2009). <em>Assessing the quality of activities in a smart environment</em>. Methods of Information in Medicine<a target="_new" rel="noopener">, 48(5):480-485. https://doi.org/10.3414/ME0592</a></p>
    Description

    These two datasets represent sensor events collected in the CASAS smart apartment testbed at Washington State University. In both sets of data, ambient sensor readings are collected while 20 participants performing five ADL activities in the apartment. This resource is valuable for designing and validating activity recognition algorithms. Further, this resource provides data for detecting errors that are helpful in assessing and intervening for functional independence.

    In the adl_noerror dataset, the five tasks are:

    1. Make a phone call. The participant moves to the phone in the dining room, looks p a specific number in the phone book, dials the number, and listens to the message. The recorded message provides cooking directions, which the participant summarizes on a notepad.
    2. Wash hands. The participant moves into the kitchen sink and washes his/her hands in the sink, using hand soap and drying their hands with a paper towel.
    3. Cook. The participant cooks a pot of oatmeal according to the directions given in the phone message. To cook the oatmeal the participant must measure water, pour the water into a pot and boil it, add oats, then put the oatmeal into a bowl with raisins and brown sugar.
    4. Eat. The participant takes the oatmeal and a medicine container to the dining room and eats the food.
    5. Clean. The participant takes all of the dishes to the sink and cleans them with water and dish soap in the kitchen.

    In the adl_error dataset, a scripted error is introduced. The errors are:

    1. Make a phone call. Error: The participant initially dials the wrong number and has to redial.
    2. Wash hands. Error: The participant does not turn the water off after washing his/her hands.
    3. Cook. Error: The participant does not turn the burner off after cooking the oatmeal.
    4. Eat. Error: The participant does not bring the medicine container with them to the dining room.
    5. Clean. Error: The participant does not use water to clean the dishes.

    The files are named according to the participant number and task number (e.g., p01.t1.csv contains sensor data for participant 1 performing task 1). There is one sensor reading in each row with fields date, time, sensor, and message.

    A floorplan of the smart apartment is provided in Chinook.png, together with the locations of the sensors. A zoomed-in look at the Chinook cabinet with sensors is provided in Chinook_Cabinet.png. The sensors are categorized (and named) as:

    • M01 - M026: PIR motion detectors (ON when detected motion starts and OFF when it stops)
    • I01 - I08: item use sensors for (in order) oatmeal, raisins, brown sugar, bowl, measuring spoon, medicine container, pot, phone book (PRESENT or ABSENT indicating item is on sensor or not)
    • D01: door sensor on kitchen cabinet (OPEN or CLOSE)
    • AD1-A and AD1-B: water sensors for kitchen sink (value indicates level)
    • AD1-C: burner sensor (value indicates level)
    • asterisk: phone use sensor
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Government of Canada, Statistics Canada (2024). High income tax filers in Canada [Dataset]. http://doi.org/10.25318/1110005501-eng
Organization logo

High income tax filers in Canada

1110005501

Explore at:
Dataset updated
Oct 28, 2024
Dataset provided by
Statistics Canadahttps://statcan.gc.ca/en
Area covered
Canada
Description

This table presents income shares, thresholds, tax shares, and total counts of individual Canadian tax filers, with a focus on high income individuals (95% income threshold, 99% threshold, etc.). Income thresholds are based on national threshold values, regardless of selected geography; for example, the number of Nova Scotians in the top 1% will be calculated as the number of taxfiling Nova Scotians whose total income exceeded the 99% national income threshold. Different definitions of income are available in the table namely market, total, and after-tax income, both with and without capital gains.

Search
Clear search
Close search
Google apps
Main menu