https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
To quote the data source: "This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., recipient is specified in som
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a temporal hypergraph dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. In email communication, messages can be sent to multiple recipients. In this dataset, nodes are email addresses at Enron, and a hyperedge is comprised of the sender and all recipients of the email. Only email addresses from a core set of employees are included. Timestamps are in ISO8601 format.
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public and posted to the web by the Federal Energy Regulatory Commission during its investigation.
The email dataset was later purchased by Leslie Kaelbling at MIT and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., the recipient is specified in some parseable format like "Doe, John" or "Mary K. Smith") and to no_address@enron.com when no recipient was specified.
Some basic statistics of this dataset are:
Component Size, Number
Source: email-Enron dataset
If you use this dataset, please cite these references:
This dataset contains the data from roughly two years of operating PrivacyMail.info, an Open Source Email privacy measurement platform. It contains slightly over 500.000 commercial newsletters, as crowdsourced by users of PrivacyMail.info. You can find the methodology discussed in our paper: Max Maass, Stephan SchwƤr, and Matthias Hollick. "Towards transparency in email tracking." Annual Privacy Forum, 2019. The source code can be found on github.com/privacymail/privacymail
Please note that, due to its crowdsourced nature, this dataset is a sample of opportunity - it is not representative for all newsletters on the Internet, and likely contains biases based on how it was collected. Notably, German-language newsletters will likely be heavily over-represented.
Dataset Structure
The dataset is structured as follows: On the top level are folders describing the website the newsletter belongs to. Inside that folder are subfolders for each identity that was registered for that website. Inside each of these folders are a series of .eml files that represent the received email messages.
Copyright and Licensing
This dataset is set to non-public due to copyright concerns: The contents of the email messages are (presumably) protected by copyright in most jurisdictions. Most copyright doctrines contain exceptions for non-commercial research use - thus, we feel it is appropriate and acceptable to share the data on a case-by-case basis, the same way we did before shutting down PrivacyMail.info. When requesting access to the data, please briefly describe what research you want to conduct with it, and we will grant you access.
We thus do not put any explicit license on this dataset. Please do not share the raw data publicly. We request that you cite the above-mentioned paper and this dataset in any publications that result from it.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset consists of a curated and anonymized collection of real job application confirmation emails from a Gmail inbox. It includes confirmation emails, rejection notices, and other relevant correspondences. The dataset was originally curated to address the challenge of eliminating manual job application tracking, allowing for automatic tracking directly from the inbox, capturing application confirmations and rejection notifications.
The dataset has been carefully pre-processed, cleaned, and enriched with derived features such as:
The dataset was originally curated to build a job application tracking agent that can automatically extract and track application updatesāsuch as confirmations, rejections, interview invites, and assessment notificationsādirectly from the inbox. The goal was to enable users to easily interact with an AI assistant to analyze and manage their job search process more efficiently.
ā ļø Disclaimer: All personal identifiable information (PII) such as names and email addresses have been fully anonymized or redacted. This dataset is intended strictly for educational and research purposes. All personally identifiable information (PII) has been carefully anonymized. Any personal names found in the dataset have been replaced with the fictional name "Michael Gary Scott" as a placeholder. This character reference is used purely for fun and does not correspond to any real individual. Please ensure any further use of this dataset respects privacy and ethical data handling practices.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains anonymised survey responses from a comprehensive study conducted to explore current email management practices among users. The survey aimed to gain insights into how individuals handle and organize their email communications in various contexts. The survey questionnaire consisted of carefully designed questions related to email usage patterns, organisational strategies, folder structures, and automation utilised for email management. The survey also explored participants' preferences for automated rule-based filtering functionality and any challenges they face in effectively managing their mailbox.
Researchers and professionals interested in email management and information organisation can leverage this dataset for research, analysis, and potential improvements in email client design and functionality.
We kindly request that any publications or research utilising this dataset appropriately acknowledge and cite the original source to ensure proper attribution to the survey and its participants.
The Enron email dataset contains approximately 500,000 emails generated by employees of the Enron Corporation. It was obtained by the Federal Energy Regulatory Commission during its investigation of Enron's collapse.
This is the May 7, 2015 Version of dataset, as published at https://www.cs.cmu.edu/~./enron/
Salutary Data is a boutique, B2B contact and company data provider that's committed to delivering high quality data for sales intelligence, lead generation, marketing, recruiting / HR, identity resolution, and ML / AI. Our database currently consists of 148MM+ highly curated B2B Contacts ( US only), along with over 4M+ companies, and is updated regularly to ensure we have the most up-to-date information.
We can enrich your in-house data ( CRM Enrichment, Lead Enrichment, etc.) and provide you with a custom dataset ( such as a lead list) tailored to your target audience specifications and data use-case. We also support large-scale data licensing to software providers and agencies that intend to redistribute our data to their customers and end-users.
What makes Salutary unique? - We offer our clients a truly unique, one-stop aggregation of the best-of-breed quality data sources. Our supplier network consists of numerous, established high quality suppliers that are rigorously vetted. - We leverage third party verification vendors to ensure phone numbers and emails are accurate and connect to the right person. Additionally, we deploy automated and manual verification techniques to ensure we have the latest job information for contacts. - We're reasonably priced and easy to work with.
Products: API Suite Web UI Full and Custom Data Feeds
Services: Data Enrichment - We assess the fill rate gaps and profile your customer file for the purpose of appending fields, updating information, and/or rendering net new ālook alikeā prospects for your campaigns. ABM Match & Append - Send us your domain or other company related files, and weāll match your Account Based Marketing targets and provide you with B2B contacts to campaign. Optionally throw in your suppression file to avoid any redundant records. Verification (āCleaning/Hygieneā) Services - Address the 2% per month aging issue on contact records! We will identify duplicate records, contacts no longer at the company, rid your email hard bounces, and update/replace titles or phones. This is right up our alley and levers our existing internal and external processes and systems.
Premium B2C Consumer Database - 269+ Million US Records
Supercharge your B2C marketing campaigns with comprehensive consumer database, featuring over 269 million verified US consumer records. Our 20+ year data expertise delivers higher quality and more extensive coverage than competitors.
Core Database Statistics
Consumer Records: Over 269 million
Email Addresses: Over 160 million (verified and deliverable)
Phone Numbers: Over 76 million (mobile and landline)
Mailing Addresses: Over 116,000,000 (NCOA processed)
Geographic Coverage: Complete US (all 50 states)
Compliance Status: CCPA compliant with consent management
Targeting Categories Available
Demographics: Age ranges, education levels, occupation types, household composition, marital status, presence of children, income brackets, and gender (where legally permitted)
Geographic: Nationwide, state-level, MSA (Metropolitan Service Area), zip code radius, city, county, and SCF range targeting options
Property & Dwelling: Home ownership status, estimated home value, years in residence, property type (single-family, condo, apartment), and dwelling characteristics
Financial Indicators: Income levels, investment activity, mortgage information, credit indicators, and wealth markers for premium audience targeting
Lifestyle & Interests: Purchase history, donation patterns, political preferences, health interests, recreational activities, and hobby-based targeting
Behavioral Data: Shopping preferences, brand affinities, online activity patterns, and purchase timing behaviors
Multi-Channel Campaign Applications
Deploy across all major marketing channels:
Email marketing and automation
Social media advertising
Search and display advertising (Google, YouTube)
Direct mail and print campaigns
Telemarketing and SMS campaigns
Programmatic advertising platforms
Data Quality & Sources
Our consumer data aggregates from multiple verified sources:
Public records and government databases
Opt-in subscription services and registrations
Purchase transaction data from retail partners
Survey participation and research studies
Online behavioral data (privacy compliant)
Technical Delivery Options
File Formats: CSV, Excel, JSON, XML formats available
Delivery Methods: Secure FTP, API integration, direct download
Processing: Real-time NCOA, email validation, phone verification
Custom Selections: 1,000+ selectable demographic and behavioral attributes
Minimum Orders: Flexible based on targeting complexity
Unique Value Propositions
Dual Spouse Targeting: Reach both household decision-makers for maximum impact
Cross-Platform Integration: Seamless deployment to major ad platforms
Real-Time Updates: Monthly data refreshes ensure maximum accuracy
Advanced Segmentation: Combine multiple targeting criteria for precision campaigns
Compliance Management: Built-in opt-out and suppression list management
Ideal Customer Profiles
E-commerce retailers seeking customer acquisition
Financial services companies targeting specific demographics
Healthcare organizations with compliant marketing needs
Automotive dealers and service providers
Home improvement and real estate professionals
Insurance companies and agents
Subscription services and SaaS providers
Performance Optimization Features
Lookalike Modeling: Create audiences similar to your best customers
Predictive Scoring: Identify high-value prospects using AI algorithms
Campaign Attribution: Track performance across multiple touchpoints
A/B Testing Support: Split audiences for campaign optimization
Suppression Management: Automatic opt-out and DNC compliance
Pricing & Volume Options
Flexible pricing structures accommodate businesses of all sizes:
Pay-per-record for small campaigns
Volume discounts for large deployments
Subscription models for ongoing campaigns
Custom enterprise pricing for high-volume users
Data Compliance & Privacy
VIA.tools maintains industry-leading compliance standards:
CCPA (California Consumer Privacy Act) compliant
CAN-SPAM Act adherence for email marketing
TCPA compliance for phone and SMS campaigns
Regular privacy audits and data governance reviews
Transparent opt-out and data deletion processes
Getting Started
Our data specialists work with you to:
Define your target audience criteria
Recommend optimal data selections
Provide sample data for testing
Configure delivery methods and formats
Implement ongoing campaign optimization
Why We Lead the Industry
With over two decades of data industry experience, we combine extensive database coverage with advanced targeting capabilities. Our commitment to data quality, compliance, and customer success has made us the preferred choice for businesses seeking superior B2C marketing performance.
Contact our team to discuss your specific targeting requirements and receive custom pricing for your marketing objectives.
EnronSR, is a benchmark dataset based on the Enron email corpus that contains both naturally occurring human- and AI-generated email replies for the same set of messages. This resource enables the public benchmarking of novel language-generation models and facilitates a comparison against the strong, production-level baseline of Google Smart Reply used by millions of people.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Panza Emails dataset
This dataset contains collections of emails of three authentic users (david, isabel, and marcus), with personal information (names, places, etc.) replaced by other ones for donor privacy. Except for these changes, the language of the emails is genuine. The intention of this dataset is to allow researchers to study strategies for text personalization. The data was donated explicitly for this purpose. This dataset is ethically collected and fully licensed for⦠See the full description on the dataset page: https://huggingface.co/datasets/ISTA-DASLab/Panza-emails.
(Toll Free) Number +1-341-900-3252 Email remains a vital communication tool for both personal and professional use. For those who have been using (Toll Free) Number +1-341-900-3252 Time Warner Cable services, the Roadrunner email service is a familiar name. (Toll Free) Number +1-341-900-3252 Now managed by Spectrum, the Roadrunner email platform is still active and accessible for users with existing accounts. However, to access all its features and ensure smooth communication, it's essential to understand how to set up, use, and manage your Roadrunner login account effectively (Toll Free) Number +1-341-900-3252 (Toll Free) Number +1-341-900-3252 .
What Is a Roadrunner Login Account? A Roadrunner login account is the email account created through Time Warner Cableās Roadrunner service, now handled by Spectrum. Although new Roadrunner accounts are no longer issued, existing users can continue to access their email using the credentials associated with their original account.
The Roadrunner login account functions like any other email service, allowing users to send, receive, organize, and store emails. It's especially popular among long-time customers who prefer the simplicity and reliability of the interface.
Setting Up a Roadrunner Login Account For users with an existing Roadrunner email address, setting up access on new devices or email clients is straightforward. While you cannot create a new Roadrunner login account, hereās how to set up your existing account on various platforms:
(Toll Free) Number +1-341-900-3252
On Web Browser Open your preferred browser.
Navigate to the Spectrum or legacy Roadrunner email portal.
Enter your Roadrunner email address and password.
Click "Sign In" to access your inbox.
On Email Clients (Outlook, Thunderbird, etc.) To configure your Roadrunner login account on email software, you need both incoming and outgoing server details:
Incoming Server (IMAP or POP3): Server: mail.twc.com Port: 993 (IMAP), 110 (POP3) Security: SSL/TLS
Outgoing Server (SMTP): Server: mail.twc.com Port: 587 Security: STARTTLS
Make sure to enter your full email address and password when setting up.
Benefits of Using a Roadrunner Login Account While Roadrunner email may seem old-school to some, it still offers various features that benefit users:
(Toll Free) Number +1-341-900-3252
Reliable Service Users report that their Roadrunner login account remains stable and reliable for both sending and receiving emails.
Simple Interface Unlike many modern, cluttered email interfaces, Roadrunner email is known for its clean and user-friendly layout.
Storage and Access Roadrunner provides decent storage limits and access across various devices including desktops, laptops, and mobile phones.
(Toll Free) Number +1-341-900-3252
Spam Filtering The spam detection system for Roadrunner login accounts helps keep your inbox clean and secure.
Troubleshooting Roadrunner Login Issues If you're having trouble accessing your Roadrunner login account, you're not alone. Below are some of the most common issues and how to fix them:
Forgot Password If you forget your Roadrunner password, visit the Spectrum account recovery page. Youāll need to verify your identity and then reset your password.
Incorrect Credentials Double-check the spelling of your email address and password. Also, make sure Caps Lock isnāt turned on, which can cause login errors.
Locked Account Too many failed login attempts may result in your Roadrunner login account being temporarily locked. Waiting a few minutes or resetting the password usually resolves this.
Server Settings If your email client isnāt working, make sure you're using the correct IMAP/POP and SMTP settings as listed above.
(Toll Free) Number +1-341-900-3252
Managing Your Roadrunner Login Account Properly managing your Roadrunner login account ensures it stays secure and functional over time. Here are a few tips:
Update Recovery Options Make sure your account has a valid recovery email or phone number, so you can regain access if needed.
Regular Password Changes For security purposes, itās advisable to change your password every few months.
Organize Emails Use folders and filters to keep your inbox organized. This will help you manage important messages more effectively.
Delete Unnecessary Emails Clearing old or unwanted messages can help you stay within storage limits and improve overall account performance.
Keeping Your Roadrunner Login Account Secure With cybersecurity threats on the rise, protecting your Roadrunner login account is more important than ever:
Use a strong and unique password combining letters, numbers, and symbols.
(Toll Free) Number +1-341-900-3252
Avoid using public Wi-Fi to access your email unless you're using a VPN.
Enable two-step authentication if available through Spectrum.
Never click suspicious links or download attachments from unknown senders.
Accessing Roadrunner Email on Mobile Devices To use your Roadrunner login account on a smartphone or tablet:
Go to your deviceās email app and add a new account.
Choose "Other" or "Manual Setup" if prompted.
Enter your Roadrunner email address and password.
Input the server settings manually as previously mentioned.
Save and sync.
(Toll Free) Number +1-341-900-3252
Once configured, you can send and receive emails from your mobile device just like you would from a computer. (Toll Free) Number +1-341-900-3252
Final Thoughts Though it may not be as modern as Gmail or Outlook, the Roadrunner login account continues to serve many long-time users with reliability and simplicity. Whether youāre checking email on your desktop or syncing it with your mobile device, understanding how to manage and secure your Roadrunner account is key to staying connected and protected. (Toll Free) Number +1-341-900-3252
The Measurable AI Amazon Consumer Transaction Dataset is a leading source of email receipts and consumer transaction data, offering data collected directly from users via Proprietary Consumer Apps, with millions of opt-in users.
We source our email receipt consumer data panel via two consumer apps which garner the express consent of our end-users (GDPR compliant). We then aggregate and anonymize all the transactional data to produce raw and aggregate datasets for our clients.
Use Cases Our clients leverage our datasets to produce actionable consumer insights such as: - Market share analysis - User behavioral traits (e.g. retention rates) - Average order values - Promotional strategies used by the key players. Several of our clients also use our datasets for forecasting and understanding industry trends better.
Coverage - Asia (Japan) - EMEA (Spain, United Arab Emirates) - Continental Europe - USA
Granular Data Itemized, high-definition data per transaction level with metrics such as - Order value - Items ordered - No. of orders per user - Delivery fee - Service fee - Promotions used - Geolocation data and more
Aggregate Data - Weekly/ monthly order volume - Revenue delivered in aggregate form, with historical data dating back to 2018. All the transactional e-receipts are sent from app to usersā registered accounts.
Most of our clients are fast-growing Tech Companies, Financial Institutions, Buyside Firms, Market Research Agencies, Consultancies and Academia.
Our dataset is GDPR compliant, contains no PII information and is aggregated & anonymized with user consent. Contact business@measurable.ai for a data dictionary and to find out our volume in each country.
List of the data tables as part of the Immigration System Statistics Home Office release. Summary and detailed data tables covering the immigration system, including out-of-country and in-country visas, asylum, detention, and returns.
If you have any feedback, please email MigrationStatsEnquiries@homeoffice.gov.uk.
The Microsoft Excel .xlsx files may not be suitable for users of assistive technology.
If you use assistive technology (such as a screen reader) and need a version of these documents in a more accessible format, please email MigrationStatsEnquiries@homeoffice.gov.uk
Please tell us what format you need. It will help us if you say what assistive technology you use.
Immigration system statistics, year ending March 2025
Immigration system statistics quarterly release
Immigration system statistics user guide
Publishing detailed data tables in migration statistics
Policy and legislative changes affecting migration to the UK: timeline
Immigration statistics data archives
https://assets.publishing.service.gov.uk/media/68258d71aa3556876875ec80/passenger-arrivals-summary-mar-2025-tables.xlsx">Passenger arrivals summary tables, year ending March 2025 (MS Excel Spreadsheet, 66.5 KB)
āPassengers refused entry at the border summary tablesā and āPassengers refused entry at the border detailed datasetsā have been discontinued. The latest published versions of these tables are from February 2025 and are available in the āPassenger refusals ā release discontinuedā section. A similar data series, āRefused entry at port and subsequently departedā, is available within the Returns detailed and summary tables.
https://assets.publishing.service.gov.uk/media/681e406753add7d476d8187f/electronic-travel-authorisation-datasets-mar-2025.xlsx">Electronic travel authorisation detailed datasets, year ending March 2025 (MS Excel Spreadsheet, 56.7 KB)
ETA_D01: Applications for electronic travel authorisations, by nationality
ETA_D02: Outcomes of applications for electronic travel authorisations, by nationality
https://assets.publishing.service.gov.uk/media/68247953b296b83ad5262ed7/visas-summary-mar-2025-tables.xlsx">Entry clearance visas summary tables, year ending March 2025 (MS Excel Spreadsheet, 113 KB)
https://assets.publishing.service.gov.uk/media/682c4241010c5c28d1c7e820/entry-clearance-visa-outcomes-datasets-mar-2025.xlsx">Entry clearance visa applications and outcomes detailed datasets, year ending March 2025 (MS Excel Spreadsheet, 29.1 MB)
Vis_D01: Entry clearance visa applications, by nationality and visa type
Vis_D02: Outcomes of entry clearance visa applications, by nationality, visa type, and outcome
Additional dat
Transform Your Business with Our Comprehensive B2B Marketing Data Our B2B Marketing Data is designed to be a cornerstone for data-driven professionals looking to optimize their business strategies. With an unwavering commitment to data integrity and quality, our dataset empowers you to make informed decisions, enhance your outreach efforts, and drive business growth.
Why Choose Our B2B Marketing Data? Unmatched Data Integrity and Quality Our data is meticulously sourced and validated through rigorous processes to ensure its accuracy, relevance, and reliability. This commitment to excellence guarantees that you are equipped with the most up-to-date information, empowering your business to thrive in a competitive landscape.
Versatile and Strategic Applications This versatile dataset caters to a wide range of business needs, including:
Lead Generation: Identify and connect with potential clients who align with your business goals. Market Segmentation: Tailor your marketing efforts by segmenting your audience based on industry, company size, or geographical location. Personalized Marketing Campaigns: Craft personalized outreach strategies that resonate with your target audience, increasing engagement and conversion rates. B2B Communication Strategies: Enhance your communication efforts with direct access to decision-makers, ensuring your message reaches the right people. Comprehensive Data Attributes Our B2B Marketing Data offers more than just basic contact information. With over 20+ attributes, you gain in-depth insights into:
Decision-Maker Roles: Understand the responsibilities and influence of key figures within an organization, such as CEOs, executives, and other senior management. Industry Affiliations: Analyze industry-specific data to tailor your approach to the unique dynamics of each sector. Contact Information: Direct email addresses and phone numbers streamline communication, enabling you to engage with your audience effectively and efficiently. Expansive Global Coverage Our dataset spans a wide array of countries, providing a truly global perspective for your business initiatives. Whether you're looking to expand into new markets or strengthen your presence in existing ones, our data ensures comprehensive coverage across the following regions:
North America: United States, Canada, Mexico Europe: United Kingdom, Germany, France, Italy, Spain, Netherlands, Sweden, and more Asia: China, Japan, India, South Korea, Singapore, Malaysia, and more South America: Brazil, Argentina, Chile, Colombia, and more Africa: South Africa, Nigeria, Kenya, Egypt, and more Australia and Oceania: Australia, New Zealand Middle East: United Arab Emirates, Saudi Arabia, Israel, Qatar, and more Industry-Wide Reach Our B2B Marketing Data covers an extensive range of industries, ensuring that no matter your focus, you have access to the insights you need:
Finance and Banking Technology Healthcare Manufacturing Retail Education Energy Real Estate Telecommunications Hospitality Transportation and Logistics Government and Public Sector Non-Profit Organizations And many more⦠Comprehensive Employee and Revenue Size Information Our dataset includes detailed records on company size and revenue, offering you the ability to:
Employee Size: From small businesses with a handful of employees to large multinational corporations, we provide data across all scales. Revenue Size: Analyze companies based on their revenue brackets, allowing for precise market segmentation and targeted marketing efforts. Seamless Integration with Broader Data Offerings Our B2B Marketing Data is not just a standalone product; it integrates seamlessly with our broader suite of premium datasets. This integration enables you to create a holistic and customized approach to your data-driven initiatives, ensuring that every aspect of your business strategy is informed by the most accurate and comprehensive data available.
Elevate Your Business with Data-Driven Precision Optimize your marketing strategies with our high-quality, reliable, and scalable B2B Marketing Data. Identify new opportunities, understand market dynamics, and connect with key decision-makers to drive your business forward. With our dataset, youāll stay ahead of the competition and foster meaningful business relationships that lead to sustained growth.
Unlock the full potential of your business with our B2B Marketing Data ā the ultimate resource for growth, reliability, and scalability.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.
If you use this dataset in any publication, project, tool or in any other form, please, cite the a paper.
Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.
The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:
Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.
Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).
Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that āGab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.ā They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.
Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).
WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.
From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.
The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).
The dataset has the following fields:
'text' - a text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the detected language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / platform of the given text,
'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.
ToDo Statistics (under construction)
Forager.ai's Small Business Contact Data set is a comprehensive collection of over 695M professional profiles. With an unmatched 2x/month refresh rate, we ensure the most current and dynamic data in the industry today. We deliver this data via JSONL flat-files or PostgreSQL database delivery, capturing publicly available information on each profile.
| Volume and Stats |
Every single record refreshed 2x per month, setting industry standards. First-party data curation powering some of the most renowned sales and recruitment platforms. Delivery frequency is hourly (fastest in the industry today). Additional datapoints and linkages available. Delivery formats: JSONL, PostgreSQL, CSV. | Datapoints |
Over 150+ unique datapoints available! Key fields like Current Title, Current Company, Work History, Educational Background, Location, Address, and more. Unique linkage data to other social networks or contact data available. | Use Cases |
Sales Platforms, ABM Vendors, Intent Data Companies, AdTech and more:
Deliver the best end-customer experience with our people feed powering your solution! Be the first to know when someone changes jobs and share that with end-customers. Industry-leading data accuracy. Connect our professional records to your existing database, find new connections to other social networks, and contact data. Hashed records also available for advertising use-cases. Venture Capital and Private Equity:
Track every company and employee with a publicly available profile. Keep track of your portfolio's founders, employees and ex-employees, and be the first to know when they move or start up. Keep an eye on the pulse by following the most influential people in the industries and segments you care about. Provide your portfolio companies with the best data for recruitment and talent sourcing. Review departmental headcount growth of private companies and benchmark their strength against competitors. HR Tech, ATS Platforms, Recruitment Solutions, as well as Executive Search Agencies:
Build products for industry-specific and industry-agnostic candidate recruiting platforms. Track person job changes and immediately refresh profiles to avoid stale data. Identify ideal candidates through work experience and education history. Keep ATS systems and candidate profiles constantly updated. Link data from this dataset into GitHub, LinkedIn, and other social networks. | Delivery Options |
Flat files via S3 or GCP PostgreSQL Shared Database PostgreSQL Managed Database REST API Other options available at request, depending on scale required | Other key features |
Over 120M US Professional Profiles. 150+ Data Fields (available upon request) Free data samples, and evaluation. Tags: Professionals Data, People Data, Work Experience History, Education Data, Employee Data, Workforce Intelligence, Identity Resolution, Talent, Candidate Database, Sales Database, Contact Data, Account Based Marketing, Intent Data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Get the count of unique users that connected to Exchange Online using any email app.
Gain exclusive access to verified Shopify store owners with our premium Shopify Users Email List. This database includes essential data fields such as Store Name, Website, Contact Name, Email Address, Phone Number, Physical Address, Revenue Size, Employee Size, and more on demand. Leverage real-time, accurate data to enhance your marketing efforts and connect with high-value Shopify merchants. Whether you're targeting small businesses or enterprise-level Shopify stores, our database ensures precision and reliability for optimized lead generation and outreach strategies. Key Highlights: ā 3.9M+ Shopify Stores ā Direct Contact Info of Shopify Store Owners ā 40+ Data Points ā Lifetime Access ā 10+ Data Segmentations ā FREE Sample Data
Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.
Dataset Statistics
# Nodes | %Fraud Nodes (Class=1) |
---|---|
11,944 | 9.5 |
Relation | # Edges |
---|---|
U-P-U | |
U-S-U | |
U-V-U | 1,036,737 |
All |
Graph Construction
The Amazon dataset includes product reviews under the Musical Instruments category. Similar to this paper, we label users with more than 80% helpful votes as benign entities and users with less than 20% helpful votes as fraudulent entities. we conduct a fraudulent user detection task on the Amazon-Fraud dataset, which is a binary classification task. We take 25 handcrafted features from this paper as the raw node features for Amazon-Fraud. We take users as nodes in the graph and design three relations: 1) U-P-U: it connects users reviewing at least one same product; 2) U-S-V: it connects users having at least one same star rating within one week; 3) U-V-U: it connects users with top 5% mutual review text similarities (measured by TF-IDF) among all users.
To download the dataset, please visit this Github repo. For any other questions, please email ytongdou(AT)gmail.com for inquiry.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Swedish Telecom interactions. This diversity ensures the dataset accurately represents the language used by Swedish speakers in Telecom contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Swedish Telecom interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
To quote the data source: "This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., recipient is specified in som