Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We use the Enron email dataset to build a network of email addresses. It contains 614586 emails sent over the period from 6 January 1998 until 4 February 2004. During the pre-processing, we remove the periods of low activity and keep the emails from 1 January 1999 until 31 July 2002 which is 1448 days of email records in total. Also, we remove email addresses that sent less than three emails over that period. In total, the Enron email network contains 6 600 nodes and 50 897 edges.
To build a graph G = (V, E), we use email addresses as nodes V. Every node vi has an attribute which is a time-varying signal that corresponds to the number of emails sent from this address during a day. We draw an edge eij between two nodes i and j if there is at least one email exchange between the corresponding addresses.
Column 'Count' in 'edges.csv' file is the number of 'From'->'To' email exchanges between the two addresses. This column can be used as an edge weight.
The file 'nodes.csv' contains a dictionary that is a compressed representation of time-series. The format of the dictionary is Day->The Number Of Emails Sent By the Address During That Day. The total number of days is 1448.
'id-email.csv' is a file containing the actual email addresses.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The total number of user mailboxes in Umeå kommun and how many are active each day of the reporting period. A mailbox is considered active if the user sent or read any email.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for CNN Dailymail Dataset
Dataset Summary
The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.
Supported Tasks and Leaderboards
'summarization': Versions… See the full description on the dataset page: https://huggingface.co/datasets/abisee/cnn_dailymail.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
There are lots of really cool datasets getting added to Kaggle every day, and as part of my job I want to help people find them. I’ve been tweeting about datasets on my personal Twitter accounts @rctatman and also releasing a weekly newsletter of interesting datasets.
I wanted to know which method was more effective at getting the word out about new datasets: Twitter or the newsletter?
This dataset contains two .csv files. One has information on the impact of tweets with links to datasets, while the other has information on the impact of the newsletter.
Twitter:
The Twitter .csv has the following information:
Fridata Newsletter:
The Fridata .csv has the following information:
This dataset was collected by the uploader, Rachael Tatman. It is released here under a CC-BY-SA license.
Most organizations today rely on email campaigns for effective communication with users. Email communication is one of the popular ways to pitch products to users and build trustworthy relationships with them. Email campaigns contain different types of CTA (Call To Action). The ultimate goal of email campaigns is to maximize the Click Through Rate (CTR). CTR = No. of users who clicked on at least one of the CTA / No. of emails delivered. This Dataset contains details of body length, sub length, mean paragraph , day of week, is weekend, etc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have developed an application and solution approach (using this dataset) for automatically generating and suggesting short email responses to support queries in a university environment. Our proposed solution can be used as one tap or one click solution for responding to various types of queries raised by faculty members and students in a university. Office of Academic Affairs (OAA), Office of Student Life (OSL) and Information Technology Helpdesk (ITD) are support functions within a university which receives hundreds of email messages on the daily basis. Email communication is still the most frequently used mode of communication by these departments. A large percentage of emails received by these departments are frequent and commonly used queries or request for information. Responding to every query by manually typing is a tedious and time consuming task. Furthermore a large percentage of emails and their responses are consists of short messages. For example, an IT support department in our university receives several emails on Wi-Fi not working or someone needing help with a projector or requires an HDMI cable or remote slide changer. Another example is emails from students requesting the office of academic affairs to add and drop courses which they cannot do it directly. The dataset consists of emails messages which are generally received by ITD, OAA and OSL in Ashoka University. The dataset also contains intermediate results while conducting machine learning experiments.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CNN/DailyMail non-anonymized summarization dataset.
There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset is about telecom industry which tells about the number of customers who churned the service. It consists of 3333 observations having 21 variables. We have to predict which customer is going to churn the service.
Account.Length: how long account has been active.
VMail.Message: Number of voice mail messages send by the customer.
Day.Mins: Time spent on day calls.
Eve.Mins: Time spent on evening calls.
Night.Mins: Time spent on night calls.
Intl. Mins: Time spent on international calls.
Day.Calls: Number of day calls by customers.
Eve.Calls: Number of evening calls by customers.
Intl.Calls: Number of international calls.
Night.Calls: Number of night calls by customer.
Day.Charge: Charges of Day Calls.
Night.Charge: Charges of Night Calls.
Eve.Charge: Charges of evening Calls.
Intl.Charge: Charges of international calls.
VMail.Plan: Voice mail plan taken by the customer or not.
State: State in Area of study.
Phone: Phone number of the customer.
Area.Code: Area Code of customer.
Int.l.Plan: Does customer have international plan or not.
CustServ.Calls: Number of customer service calls by customer.
Churn : Customers who churned the telecom service or who doesn’t(0=“Churner”, 1=“Non-Churner”)
DomainIQ is a comprehensive global Domain Name dataset for organizations that want to build cyber security, data cleaning and email marketing applications. The dataset consists of the DNS records for over 267 million domains, updated daily, representing more than 90% of all public domains in the world.
The data is enriched by over thirty unique data points, including identifying the mailbox provider for each domain and using AI based predictive analytics to identify elevated risk domains from both a cyber security and email sending reputation perspective.
DomainIQ from Datazag offers layered intelligence through a highly flexible API and as a dataset, available for both cloud and on-premises applications. Standard formats include CSV, JSON, Parquet, and DuckDB.
Custom options are available for any other file or database format. With daily updates and constant research from Datazag, organizations can develop their own market leading cyber security, data cleaning and email validation applications supported by comprehensive and accurate data from Datazag. Data updates available on a daily, weekly and monthly basis. API data is updated on a daily basis.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains images of various plastic objects commonly found in everyday life. Each image is annotated with bounding boxes around the plastic items, allowing for object detection tasks in computer vision applications. With a diverse range of items such as milk packets, ketchup pouches, pens, plastic bottles, polythene bags, shampoo bottles and pouches, chips packets, cleaning spray bottles, handwash bottles, and more, this dataset offers rich training material for developing object detection models.
The dataset is an extremely challenging set of over 4000+ original Plastic object images captured and crowdsourced from over 1000+ urban and rural areas, where each image is ** manually reviewed and verified** by computer vision professionals at Datacluster Labs.
Optimized for Generative AI, Visual Question Answering, Image Classification, and LMM development, this dataset provides a strong basis for achieving robust model performance.
COCO, YOLO, PASCAL-VOC, Tf-Record
The images in this dataset are exclusively owned by Data Cluster Labs and were not downloaded from the internet. To access a larger portion of the training dataset for research and commercial purposes, a license can be purchased. Contact us at sales@datacluster.ai Visit www.datacluster.ai to know more.
PLEASE NOTE: This dataset, which includes all TLC Licensed Drivers who are in good standing and able to drive, is updated every day in the evening between 4-7pm. Please check the 'Last Update Date' field to make sure the list has updated successfully. 'Last Update Date' should show either today or yesterday's date, depending on the time of day. If the list is outdated, please download the most recent list from the link below. http://www1.nyc.gov/assets/tlc/downloads/datasets/tlc_medallion_drivers_active.csv This is a list of drivers with a current TLC Driver License, which authorizes drivers to operate NYC TLC licensed yellow and green taxicabs and for-hire vehicles (FHVs). This list is accurate as of the date and time shown in the Last Date Updated and Last Time Updated fields. Questions about the contents of this dataset can be sent by email to: licensinginquiries@tlc.nyc.gov.
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
This dataset describes the current state of mail ballot requests for the 2025 Municipal Primary Election. It’s a snapshot in time of the current volume of ballot requests across the Commonwealth. The file contains all mail ballot requests except ballot applications that are declined as duplicate.
This point-in-time transactional data is being published for informational purposes to provide detailed data pertaining to the processing of absentee and mail-in ballots by county election offices. This data is extracted once per day from the Statewide Uniform Registry of Electors (SURE system), and it reflects activity recorded by the counties in the SURE system at the time of the data extraction.
Please note that county election offices will continue to process ballot applications (as applicable), record ballots, reconcile ballot data, and make corrections when necessary, and this will continue through, and even after, Election Day. Administrative practices for recording transactions in the system will vary by county. For example, some counties record individual transactions as they occur, while others record transactions in batches at specific intervals. These activities may result in substantial changes to a county's reported data from one day to the next. County practices also differ on when cancelled ballot data is entered into the database (i.e., before or after the election). Some counties do not enter cancelled ballot data entirely.
Additional notes specific to this dataset: • Counties can enter cancellation codes without entering a ballot returned date. • Some cancellation codes are a result of administrative processes, meaning the ballot was never mailed to the voter before it was cancelled (e.g., there was an error when the label was printed). • Confidential and protected voters are not included in this file. • Counties can only enter one cancel code per ballot, even if there are multiple errors. Different counties may vary in what code they choose to use when this arises, or they may choose to use the catch-all category of 'CANC - OTHER'.
Type of data included in this file: This data includes all mail ballot applications processed by counties, which includes voters on the permanent mail-in and absentee ballot lists. Multiple rows in this data may correspond to the same voter if they submitted more than one application or had a(n) cancelled ballot(s). A deidentified voter ID has been provided to allow data users to identify when rows correspond to the same voter. This ID is randomized and cannot be used to match to SURE, the Full Voter Export, or previous iterations of the Statewide Mail Ballot File. All application types in this file are considered a type of mail ballot. Some of the applications are considered UOCAVA (Uniformed and Overseas Citizens Absentee Voting Act) or UMOVA (Uniform Military and Overseas Voters Act) ballots. These are listed below:
• CRI - Civilian - Remote/Isolated • CVO - Civilian Overseas • F - Federal (Unregistered) • M - Military • MRI - Military - Remote/Isolated • V - Veteran • BV - Bedridden Veteran • BVRI - Bedridden Veteran - Remote/Isolated *We may not have all application types in the file for every election.
M2SDNXSLV (or statD_2d_slv_Nx) is a 2-dimensional daily data collection in Modern-Era Retrospective analysis for Research and Applications version 2 (MERRA-2). This collection consists of daily statistics, such as daily mean (or daily minimum and maximum) air temperature at 2-meter, and maximum precipitation rate during the period. MERRA-2 is the latest version of global atmospheric reanalysis for the satellite era produced by NASA Global Modeling and Assimilation Office (GMAO) using the Goddard Earth Observing System Model (GEOS) version 5.12.4. The dataset covers the period of 1980-present with the latency of ~3 weeks after the end of a month. Data Reprocessing: Please check “Records of MERRA-2 Data Reprocessing and Service Changes” linked from the “Documentation” tab on this page. Note that a reprocessed data filename is different from the original file.MERRA-2 Mailing List: Sign up to receive information on reprocessing of data, changing of tools and services, as well as data announcements from GMAO. Contact the GES DISC Help Desk (gsfc-dl-help-disc@mail.nasa.gov) to be added to the list.Questions: If you have a question, please read "MERRA-2 File Specification Document", “MERRA-2 Data Access – Quick Start Guide”, and FAQs linked from the ”Documentation” tab on this page. If that does not answer your question, you may post your question to the NASA Earthdata Forum (forum.earthdata.nasa.gov) or email the GES DISC Help Desk (gsfc-dl-help-disc@mail.nasa.gov).
List of the data tables as part of the Immigration system statistics Home Office release. Summary and detailed data tables covering the immigration system, including out-of-country and in-country visas, asylum, detention, and returns.
If you have any feedback, please email MigrationStatsEnquiries@homeoffice.gov.uk.
The Microsoft Excel .xlsx files may not be suitable for users of assistive technology.
If you use assistive technology (such as a screen reader) and need a version of these documents in a more accessible format, please email MigrationStatsEnquiries@homeoffice.gov.uk
Please tell us what format you need. It will help us if you say what assistive technology you use.
Immigration system statistics, year ending June 2025
Immigration system statistics quarterly release
Immigration system statistics user guide
Publishing detailed data tables in migration statistics
Policy and legislative changes affecting migration to the UK: timeline
Immigration statistics data archives
https://assets.publishing.service.gov.uk/media/689efececc5ef8b4c5fc448c/passenger-arrivals-summary-jun-2025-tables.ods">Passenger arrivals summary tables, year ending June 2025 (ODS, 31.3 KB)
‘Passengers refused entry at the border summary tables’ and ‘Passengers refused entry at the border detailed datasets’ have been discontinued. The latest published versions of these tables are from February 2025 and are available in the ‘Passenger refusals – release discontinued’ section. A similar data series, ‘Refused entry at port and subsequently departed’, is available within the Returns detailed and summary tables.
https://assets.publishing.service.gov.uk/media/689efd8307f2cc15c93572d8/electronic-travel-authorisation-datasets-jun-2025.xlsx">Electronic travel authorisation detailed datasets, year ending June 2025 (MS Excel Spreadsheet, 57.1 KB)
ETA_D01: Applications for electronic travel authorisations, by nationality
ETA_D02: Outcomes of applications for electronic travel authorisations, by nationality
https://assets.publishing.service.gov.uk/media/68b08043b430435c669c17a2/visas-summary-jun-2025-tables.ods">Entry clearance visas summary tables, year ending June 2025 (ODS, 56.1 KB)
https://assets.publishing.service.gov.uk/media/689efda51fedc616bb133a38/entry-clearance-visa-outcomes-datasets-jun-2025.xlsx">Entry clearance visa applications and outcomes detailed datasets, year ending June 2025 (MS Excel Spreadsheet, 29.6 MB)
Vis_D01: Entry clearance visa applications, by nationality and visa type
Vis_D02: Outcomes of entry clearance visa applications, by nationality, visa type, and outcome
Additional data relating to in country and overseas Visa applications can be fo
The sonic data within the building array is composed of 26 days of 30-minute average data from 30 sonic anemometers. The unobstructed tower sonic data is also the same, but of the 5 heights of the tower. The data files have 48 columns associated with date and time identifiers as well as meteorological turbulence measurements. This dataset is not publicly accessible because: The data were not collected by EPA and are hosted external to the agency. It can be accessed through the following means: The detailed sonic dataset is freely available to others wishing to perform additional analysis however, it is large and not readily posted. The complete dataset is included in the comprehensive JR II data archive set up by the DHS Science and Technology (S&T) Directorate, Chemical Security Analysis Center (CSAC). To obtain the data, an email request can be sent to JackRabbit@st.dhs.gov. The user can then access the archive on the Homeland Security Information Network (HSIN). Format: The sonic data within the Jack Rabbit II (JRII) mock-urban building array are in 30-minute averaged daily excel files separated by each sonic anemometer with numerous variables. The unobstructed, raw 10Hz tower data are in .dat files and processed into 30-minute average daily csv files by sonic height.
This dataset is associated with the following publication: Pirhalla, M., D. Heist, S. Perry, S. Hanna, T. Mazzola, S.P. Arya, and V. Aneja. Urban Wind Field Analysis from the Jack Rabbit II Special Sonic Anemometer Study. ATMOSPHERIC ENVIRONMENT. Elsevier Science Ltd, New York, NY, USA, 243: 14, (2020).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General description
SAPFLUXNET contains a global database of sap flow and environmental data, together with metadata at different levels.
SAPFLUXNET is a harmonised database, compiled from contributions from researchers worldwide. This version (0.1.4) contains more than 200 datasets, from all over the World, covering a broad range of bioclimatic conditions.
More information on the coverage can be found here: http://sapfluxnet.creaf.cat/shiny/sfn_progress_dashboard/.
The SAPFLUXNET project has been developed by researchers at CREAF and other institutions (http://sapfluxnet.creaf.cat/#team), coordinated by Rafael Poyatos (CREAF, http://www.creaf.cat/staff/rafael-poyatos-lopez), and funded by two Spanish Young Researcher's Grants (SAPFLUXNET, CGL2014-55883-JIN; DATAFORUSE, RTI2018-095297-J-I00 ) and an Alexander von Humboldt Research Fellowship for Experienced Researchers).
Variables and units
SAPFLUXNET contains whole-plant sap flow and environmental variables at sub-daily temporal resolution. Both sap flow and environmental time series have accompanying flags in a data frame, one for sap flow and another for environmental
variables. These flags store quality issues detected during the quality control process and can be used to add further quality flags.
Metadata contain relevant variables informing about site conditions, stand characteristics, tree and species attributes, sap flow methodology and details on environmental measurements. To learn more about variables, units and data flags please use the functionalities implemented in the sapfluxnetr package (https://github.com/sapfluxnet/sapfluxnetr). In particular, have a look at the package vignettes using R:
# remotes::install_github(
# 'sapfluxnet/sapfluxnetr',
# build_opts = c("--no-resave-data", "--no-manual", "--build-vignettes")
# )
library(sapfluxnetr)
# to list all vignettes
vignette(package='sapfluxnetr')
# variables and units
vignette('metadata-and-data-units', package='sapfluxnetr')
# data flags
vignette('data-flags', package='sapfluxnetr')
Data formats
SAPFLUXNET data can be found in two formats: 1) RData files belonging to the custom-built 'sfn_data' class and 2) Text files in .csv format. We recommend using the sfn_data objects together with the sapfluxnetr package, although we also provide the text files for convenience. For each dataset, text files are structured in the same way as the slots of sfn_data objects; if working with text files, we recommend that you check the data structure of 'sfn_data' objects in the corresponding vignette.
Working with sfn_data files
To work with SAPFLUXNET data, first they have to be downloaded from Zenodo, maintaining the folder structure. A first level in the folder hierarchy corresponds to file format, either RData files or csv's. A second level corresponds to how sap flow is expressed: per plant, per sapwood area or per leaf area. Please note that interconversions among the magnitudes have been performed whenever possible. Below this level, data have been organised per dataset. In the case of RData files, each dataset is contained in a sfn_data object, which stores all data and metadata in different slots (see the vignette 'sfn-data-classes'). In the case of csv files, each dataset has 9 individual files, corresponding to metadata (5), sap flow and environmental data (2) and their corresponding data flags (2).
After downloading the entire database, the sapfluxnetr package can be used to:
- Work with data from a single site: data access, plotting and time aggregation.
- Select the subset datasets to work with.
- Work with data from multiple sites: data access, plotting and time aggregation.
Please check the following package vignettes to learn more about how to work with sfn_data files:
Working with text files
We recommend to work with sfn_data objects using R and the sapfluxnetr package and we do not currently provide code to work with text files.
Data issues and reporting
Please report any issue you may find in the database by sending us an email: sapfluxnet@creaf.uab.cat.
Temporary data fixes, detected but not yet included in released versions will be published in SAPFLUXNET main web page ('Known data errors').
Data access, use and citation
This version of the SAPFLUXNET database is open access. We are working on a data paper describing the database, but, before its publication, please cite this Zenodo entry if SAPFLUXNET is used in any publication.
The Reminder extension for CKAN enhances data management by providing automated email notifications based on dataset expiry dates and update subscriptions. Designed to work with CKAN versions 2.2 and up, but tested on 2.5.2, this extension offers a straightforward mechanism for keeping users informed about dataset updates and expirations, promoting better data governance and engagement. The extension leverages a daily cron job to check expiry dates and trigger emails. Key Features: Data Expiry Notifications: Sends email notifications when datasets reach their specified expiry date. A daily cronjob process determines when to send these emails. Note that failure of the cronjob will prevent email delivery for that day. Dataset Update Subscriptions: Allows users to subscribe to specific datasets to receive notifications upon updates via a subscription form snippet that can be included in dataset templates. Unsubscribe Functionality: Includes an unsubscribe link in each notification email, enabling users to easily manage their subscriptions. Configuration Settings: Supports at least one recipient for reminder emails via configuration settings in the CKAN config file. Bootstrap Styling: Intended for use with Bootstrap 3+ for styling, but may still work with Bootstrap 2 with potential style inconsistencies. Technical Integration: The Reminder extension integrates into CKAN via plugins, necessitating the addition of reminder to the ckan.plugins setting in the CKAN configuration file. The extension requires database initialization using paster commands to support the subscription functionality. Setting up a daily cronjob is necessary for the automated sending of reminder and notification emails. Benefits & Impact: By implementing the Reminder extension, CKAN installations can improve data management and user engagement. Automated notifications ensure that stakeholders are aware of dataset expirations and updates, leading to better data governance, and more active user involvement in data ecosystems. This extension provides an easy-to-implement solution for managing data lifecycles and keeping users informed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is a collection of aggregated clinical parameters for the participants (such as clinical scores), parameters extracted from the utilized devices (such as average heart rate per day, average gait speed etc.), and coupled events about them (such as falls, loss of orientation etc.). It contains information which was collected during the clinical evaluation of the older people from medical experts.This information represents the clinical status of the older person across different domains, e.g. physical, psychological, cognitive etc.
The dataset contains several medical features which are used by clinicians to assess the overall state of the older people.
The purpose of the Virtual Patient Model is to assess the overall state of the older people based on their medical parameters, and to find associations between these parameters and frailty status.
A list of the recorded clinical parameters and their description is shown below:
- part_id: The user ID, which should be a 4-digit number
- q_date: The recording timestamp, which follows the “YYYY-MM-DDTHH:mm:ss.fffZ” format (eg. 14 September 2017 12:23:34.567, is formatted as 2019-09-14T12:23:34.567Z)
- clinical_visit: As several clinical evaluations were performed to each older adult, this number shows for which clinical evaluation these measurements refer to
- fried: Ordinal categorization of frailty level according to Fried operational definition of frailty
- hospitalization_one_year: Number of nonscheduled hospitalizations in the last year
- hospitalization_three_years: Number of nonscheduled hospitalizations in the last three years
- ortho_hypotension: Presence of orthostatic hypotension
- vision: Visual difficulty (qualitative ordinal evaluation)
- audition: Hearing difficulty (qualitative ordinal evaluation)
- weight_loss: Unintentional weight loss >4.5 kg in the past year (categorical answer)
- exhaustion_score: Self-reported exhaustion (categorical answer)
- raise_chair_time: Time in seconds to perform a lower limb strength clinical test
- balance_single: Single foot station (Balance) (categorical answer)
- gait_get_up: Time in seconds to perform the 3meters’ Timed Get Up And Go Test
- gait_speed_4m: Speed for 4 meters’ straight walk
- gait_optional_binary: Gait optional evaluation (qualitative evaluation by the investigator)
- gait_speed_slower: Slowed walking speed (categorical answer)
- grip_strength_abnormal: Grip strength outside the norms (categorical answer)
- low_physical_activity: Low physical activity (categorical answer)
- falls_one_year: Number of falls in the last year
- fractures_three_years: Number of fractures during the last 3 years
- fried_clinician: Fried’s categorization according to clinician’s estimation (when missing data for answering the Fried’s operational frailty definition questionnaire)
- bmi_score: Body Mass Index (in Kg/m²)
- bmi_body_fat: Body Fat (%)
- waist: Waist circumference (in cm)
- lean_body_mass: Lean Body Mass (%)
- screening_score: Mini Nutritional Assessment (MNA) screening score
- cognitive_total_score: Montreal Cognitive Assessment (MoCA) test score
- memory_complain: Memory complain (categorical answer)
- mmse_total_score: Folstein Mini-Mental State Exam score
- sleep: Reported sleeping problems (qualitative ordinal evaluation)
- depression_total_score: 15-item Geriatric Depression Scale (GDS-15)
- anxiety_perception: Anxiety auto-evaluation (visual analogue scale 0-10)
- living_alone: Living Conditions (categorical answer)
- leisure_out: Leisure activities (number of leisure activities per week)
- leisure_club: Membership of a club (categorical answer)
- social_visits: Number of visits and social interactions per week
- social_calls: Number of telephone calls exchanged per week
- social_phone: Approximate time spent on phone per week
- social_skype: Approximate time spent on videoconference per week
- social_text: Number of written messages (SMS and emails) sent by the participant per week
- house_suitable_participant: Subjective suitability of the housing environment according to participant’s evaluation (categorical answer)
- house_suitable_professional: Subjective suitability of the housing environment according to investigator’s evaluation (categorical answer)
- stairs_number: Number of steps to access house (without possibility to use elevator)
- life_quality: Quality of life self-rating (visual analogue scale 0-10)
- health_rate: Self-rated health status (qualitative ordinal evaluation)
- health_rate_comparison: Self-assessed change since last year (qualitative ordinal evaluation)
- pain_perception: Self-rated pain (visual analogue scale 0-10)
- activity_regular: Regular physical activity (ordinal answer)
- smoking: Smoking (categorical answer)
- alcohol_units: Alcohol Use (average alcohol units consumption per week)
- katz_index: Katz Index of ADL score
- iadl_grade: Instrumental Activities of Daily Living score
- comorbidities_count: Number of comorbidities
- comorbidities_significant_count: Number of comorbidities which affect significantly the person’s functional status
- medication_count: Number of active substances taken on a regular basis
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CEAS-08 Email Phishing Detection Instruction Dataset
This dataset contains instruction-following conversations for email phishing detection, generated from the CEAS-08 email dataset using multiple large language models. It's designed for fine-tuning conversational AI models on cybersecurity tasks.
Dataset Details
Dataset Description
This dataset transforms raw email data into structured instruction-following conversations where an AI security analyst analyzes… See the full description on the dataset page: https://huggingface.co/datasets/luongnv89/phishing-email.
https://www.icpsr.umich.edu/web/ICPSR/studies/8412/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/8412/terms
The primary purpose of this survey was to develop a description of the United States household mailstream for the United States Postal Service (USPS) and to provide annualized, nationwide estimates of the volume of mail received and sent by households in the United States. To this end, the survey gathered information on the characteristics of every USPS letter and package that was sent or received by each sampled household on every day of a preassigned week in the survey period. Daily accounts of items not handled by the USPS were also gathered, e.g., United Parcel Service, telegrams, long-distance telephone calls, newspapers, magazines, advertisements, free samples, campaign literature, and utility bills. In addition to providing mailstream information, respondents answered questions pertaining to their mail delivery and mailing practices, their knowledge of mail and other means of communications, and their opinions on both the performance of the USPS and on proposed changes in mail service and rates. They also supplied information on any stamp collectors living in their household, the age and sex of the collectors, the kinds of stamps they collected, and their expenditures on United States commemorative stamps and corner stamps from sheets of new USPS issues. The dataset includes data on the location of the household, length of residence in the current dwelling unit, family income, the age of each household member, and the age, sex, race, education, occupation, and employment status of the respondent and the head of household.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We use the Enron email dataset to build a network of email addresses. It contains 614586 emails sent over the period from 6 January 1998 until 4 February 2004. During the pre-processing, we remove the periods of low activity and keep the emails from 1 January 1999 until 31 July 2002 which is 1448 days of email records in total. Also, we remove email addresses that sent less than three emails over that period. In total, the Enron email network contains 6 600 nodes and 50 897 edges.
To build a graph G = (V, E), we use email addresses as nodes V. Every node vi has an attribute which is a time-varying signal that corresponds to the number of emails sent from this address during a day. We draw an edge eij between two nodes i and j if there is at least one email exchange between the corresponding addresses.
Column 'Count' in 'edges.csv' file is the number of 'From'->'To' email exchanges between the two addresses. This column can be used as an edge weight.
The file 'nodes.csv' contains a dictionary that is a compressed representation of time-series. The format of the dictionary is Day->The Number Of Emails Sent By the Address During That Day. The total number of days is 1448.
'id-email.csv' is a file containing the actual email addresses.