28 datasets found
  1. Spam_email_Dataset

    • kaggle.com
    zip
    Updated Aug 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandita Pore (2023). Spam_email_Dataset [Dataset]. https://www.kaggle.com/datasets/nanditapore/spam-email-dataset
    Explore at:
    zip(310235 bytes)Available download formats
    Dataset updated
    Aug 22, 2023
    Authors
    Nandita Pore
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description

    This dataset contains synthetic data designed for practicing spam email classification. The dataset includes various features extracted from email messages, such as the email's content, sender and recipient information, as well as metadata like date and time of sending, attachment count, link count, and more.

    Columns

    • Email: The email address of the sender.
    • Subject: The subject line of the email.
    • Sender: The email address of the sender.
    • Recipient: The email address of the recipient.
    • Date: The date when the email was sent.
    • Time (24 hours format): The time of day when the email was sent (in 24-hour format).
    • Attachments: The number of attachments present in the email.
    • Link Count: The number of hyperlinks present in the email.
    • Word Count: The total number of words in the email.
    • Uppercase Count: The count of words in uppercase letters.
    • Exclamation Count: The count of exclamation marks in the email.
    • Question Count: The count of question marks in the email.
    • Dollar Count: The count of dollar signs in the email.
    • Punctuation Count: The count of various punctuation marks (e.g., commas, periods).
    • HTML Tags Count: The count of HTML tags in the email.
    • Spam Indicator: A binary label indicating whether the email is spam (1) or not (0).

    Usage

    This dataset is intended for practicing and experimenting with binary classification tasks, specifically spam email classification. Participants can explore the relationships between different features and the spam indicator to build and evaluate machine learning models for detecting spam emails. Please note that this dataset contains synthetic data generated for educational purposes.

    Note

    The data in this dataset is synthetic and generated using the Faker library, with random values for demonstration purposes. It does not accurately represent real email content or spam characteristics. Therefore, it's recommended to use this dataset for learning and practicing classification techniques rather than for developing production-level models.

    Acknowledgments

    This dataset was created for educational purposes and is inspired by real-world email data. It was generated using the Faker library and is released under the Creative Commons License.

  2. Spam Еmails by Country 2023

    • kaggle.com
    zip
    Updated Mar 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    derina (2023). Spam Еmails by Country 2023 [Dataset]. https://www.kaggle.com/datasets/derina/spam-emails-by-contry-2023
    Explore at:
    zip(342 bytes)Available download formats
    Dataset updated
    Mar 10, 2023
    Authors
    derina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The United States ranked first in the world for sending the most spam emails in a single day as of January 16, 2023, with about eight billion. Czechia and the Netherlands followed closely with 7.7 billion and 7.6 billion spam emails, respectively.

    Global trends in internet and email usage The number of email users worldwide grew from 3.9 billion in 2019 to 4.1 billion in 2021 and is projected to reach 4.6 billion by 2025. However, email usage varies across countries. For instance, China and India had the largest internet populations as of July 2021, with over 979 million and 845 million users each, but they used email less frequently than users in the United States or Germany.

    Email as the top online activity in the U.S. Email was not only the most common source of spam messages globally as of October 2021, but also the most popular online activity among U.S. internet users in 2019. In fact , email users accounted for 90.9 percent of respondents, surpassing search users, social network users, or digital video viewers.

    Data by Cisco Talos

  3. Daily Mail Summarization Dataset

    • kaggle.com
    zip
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evil Spirit05 (2024). Daily Mail Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/evilspirit05/daily-mail-summarization-dataset
    Explore at:
    zip(52096 bytes)Available download formats
    Dataset updated
    Aug 6, 2024
    Authors
    Evil Spirit05
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description
    The "Daily Mail Articles and Highlights" dataset comprises a meticulously curated collection of 8,176 articles, along with their corresponding highlights, sourced directly from the Daily Mail website. This extensive dataset is designed to facilitate the development and training of sophisticated text summarization models that can generate concise and accurate summaries for long-form articles.
    

    Objective

    The primary goal of this dataset is to train a text summarization model capable of producing brief, yet informative, summaries of given articles. This endeavor is particularly beneficial for readers who seek to grasp the essential points of lengthy articles quickly, thereby enhancing their reading efficiency and comprehension.
    

    Data Collection Process

    The dataset was compiled through an automated web scraping process, ensuring the inclusion of a diverse range of articles spanning various topics and categories. Each article in the dataset is paired with its highlight, which serves as a reference summary. The highlights are succinct extracts that encapsulate the core message of the articles, providing a foundation for training summarization models.
    

    https://www.dailymail.co.uk/home/index.html

    Technical Framework

    To achieve the goal of creating an efficient summarization system, we employ a combination of cutting-edge technologies and libraries, including:
    
    • Hugging Face's Transformers: A powerful library that provides pre-trained models and tools for natural language processing tasks. For this project, we leverage the DistilBERT model, known for its efficiency and performance in text summarization tasks.
    • Blurr: A library that bridges the gap between Hugging Face’s Transformers and Fastai, enabling seamless integration and enhanced model training capabilities.
    • Fastai: An accessible deep learning library that simplifies the process of building and training models. Fastai's user-friendly interface and robust functionalities are instrumental in developing and fine-tuning the summarization model.

    Implementation Strategy

    The summarization model is trained using the collected dataset, following a structured workflow:
    
    • Preprocessing: The articles and highlights are cleaned and preprocessed to ensure consistency and quality. This step includes tokenization, normalization, and handling of special characters.
    • Model Training: Utilizing the DistilBERT model from Hugging Face's Transformers, the training process involves fine-tuning the model on the preprocessed dataset. The integration of Blurr and Fastai facilitates efficient training and model optimization.
    • Evaluation and Tuning: The model's performance is evaluated using various metrics, such as ROUGE scores, to assess the quality of the generated summaries. Continuous tuning and iteration are performed to enhance the model’s accuracy and reliability.

    Applications

    The resulting summarization system is designed to automatically produce concise and informative summaries, which can be used in various applications, including:
    
    • News Aggregation Platforms: Providing readers with quick summaries of news articles, enhancing their ability to stay informed with minimal time investment.
    • Educational Tools: Assisting students and researchers by summarizing lengthy academic articles and papers.
    • Content Management Systems: Enabling efficient content curation and management by generating summaries for large volumes of articles.

    Conclusion

    The "Daily Mail Articles and Highlights" dataset is a valuable resource for advancing the field of text summarization. By leveraging state-of-the-art techniques and libraries, this project aims to develop a robust summarization model that can significantly improve the way we consume and process information. This dataset not only supports the creation of efficient summarization systems but also contributes to the broader goal of making information more accessible and digestible for all.
    
  4. Click-through rates of marketing e-mails worldwide 2023, by country

    • statista.com
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. G. Navarro (2024). Click-through rates of marketing e-mails worldwide 2023, by country [Dataset]. https://www.statista.com/topics/1446/e-mail-marketing/
    Explore at:
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    J. G. Navarro
    Description

    In 2023, marketing e-mails in Canada had a click-through rate of 8.68 percent, highest among the selected countries presented in the data set. In Germany, the rate stood at 2.37 percent.

  5. h

    cnn_dailymail

    • huggingface.co
    Updated Aug 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abigail See (2023). cnn_dailymail [Dataset]. https://huggingface.co/datasets/abisee/cnn_dailymail
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2023
    Authors
    Abigail See
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for CNN Dailymail Dataset

      Dataset Summary
    

    The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

      Supported Tasks and Leaderboards
    

    'summarization': Versions… See the full description on the dataset page: https://huggingface.co/datasets/abisee/cnn_dailymail.

  6. g

    The total number of mailboxes and number of active mailboxes every day |...

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The total number of mailboxes and number of active mailboxes every day | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_https-opendata-umea-se-api-v2-catalog-datasets-getmailboxusagemailboxcounts0/
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The total number of user mailboxes in Umeå kommun and how many are active each day of the reporting period. A mailbox is considered active if the user sent or read any email.

  7. o

    The total number of mailboxes and number of active mailboxes every day

    • opendataumea.aws-ec2-eu-central-1.opendatasoft.com
    • opendata.umea.se
    • +1more
    csv, excel, json
    Updated Dec 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). The total number of mailboxes and number of active mailboxes every day [Dataset]. https://opendataumea.aws-ec2-eu-central-1.opendatasoft.com/explore/dataset/getmailboxusagemailboxcounts0/api/?flg=en-gb
    Explore at:
    json, csv, excelAvailable download formats
    Dataset updated
    Dec 1, 2025
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The total number of user mailboxes in Umeå kommun and how many are active each day of the reporting period. A mailbox is considered active if the user sent or read any email.

  8. h

    cnn_dailymail

    • huggingface.co
    • tensorflow.org
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv, cnn_dailymail [Dataset]. https://huggingface.co/datasets/ccdv/cnn_dailymail
    Explore at:
    Authors
    ccdv
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    CNN/DailyMail non-anonymized summarization dataset.

    There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary

  9. Email CTR Prediction

    • kaggle.com
    zip
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sk4467 (2022). Email CTR Prediction [Dataset]. https://www.kaggle.com/datasets/sk4467/email-ctr-prediction
    Explore at:
    zip(59241 bytes)Available download formats
    Dataset updated
    Nov 15, 2022
    Authors
    Sk4467
    Description

    Most organizations today rely on email campaigns for effective communication with users. Email communication is one of the popular ways to pitch products to users and build trustworthy relationships with them. Email campaigns contain different types of CTA (Call To Action). The ultimate goal of email campaigns is to maximize the Click Through Rate (CTR). CTR = No. of users who clicked on at least one of the CTA / No. of emails delivered. This Dataset contains details of body length, sub length, mean paragraph , day of week, is weekend, etc.

  10. g

    Medallion Drivers - Active | gimi9.com

    • gimi9.com
    Updated Apr 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Medallion Drivers - Active | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_medallion-drivers-active
    Explore at:
    Dataset updated
    Apr 2, 2020
    Description

    PLEASE NOTE: This dataset, which includes all TLC Licensed Drivers who are in good standing and able to drive, is updated every day in the evening between 4-7pm. Please check the 'Last Update Date' field to make sure the list has updated successfully. 'Last Update Date' should show either today or yesterday's date, depending on the time of day. If the list is outdated, please download the most recent list from the link below. http://www1.nyc.gov/assets/tlc/downloads/datasets/tlc_medallion_drivers_active.csv This is a list of drivers with a current TLC Driver License, which authorizes drivers to operate NYC TLC licensed yellow and green taxicabs and for-hire vehicles (FHVs). This list is accurate as of the date and time shown in the Last Date Updated and Last Time Updated fields. Questions about the contents of this dataset can be sent by email to: licensinginquiries@tlc.nyc.gov.

  11. d

    Alesco 30 Day New Mover (New Homeowner) Data US based emails, phones and...

    • datarade.ai
    .csv, .xls, .txt
    Updated Jul 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alesco Data (2023). Alesco 30 Day New Mover (New Homeowner) Data US based emails, phones and addresses of people who have moved in the last 30 days. Licensing Available! [Dataset]. https://datarade.ai/data-products/alesco-30-day-new-mover-database-us-based-with-emails-and-p-alesco-data
    Explore at:
    .csv, .xls, .txtAvailable download formats
    Dataset updated
    Jul 31, 2023
    Dataset authored and provided by
    Alesco Data
    Area covered
    United States of America
    Description

    Harness the Power of Fresh New Homeowner Audience Data

    Our comprehensive New Homeowner Audience Data file is a meticulously curated compilation of Direct Marketing data, enriched with valuable Email Address Data. This essential resource offers unparalleled access to Consumers and Prospects who have recently moved into new homes or apartments.

    Averaging an impressive 1.1 million records monthly, our dataset is continually updated with the latest information, including a dedicated 30-day hotline file for the most recent movers. This ensures you're always working with the freshest and most relevant data.

    With an average income surpassing $55K and a high concentration of families, these new homeowners present a prime opportunity for businesses across various sectors. From healthcare providers and home improvement specialists to financial advisors and interior designers, our data empowers you to identify and reach your ideal customer.

    Benefit from our flexible pricing options, allowing you to tailor your data acquisition to your specific business needs. Choose from transactional purchases or opt for annual licensing with unlimited use cases for marketing and analytics.

    Unlock the full potential of your marketing campaigns with our New Homeowner Audience Data.

  12. 2025 Municipal Primary Election Mail Ballot Requests Department of State

    • data.pa.gov
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of State (2025). 2025 Municipal Primary Election Mail Ballot Requests Department of State [Dataset]. https://data.pa.gov/Government-Efficiency-Citizen-Engagement/2025-Municipal-Primary-Election-Mail-Ballot-Reques/ih4x-yb7a
    Explore at:
    kmz, csv, application/geo+json, xlsx, xml, kmlAvailable download formats
    Dataset updated
    Oct 8, 2025
    Dataset provided by
    United States Department of Statehttp://state.gov/
    Authors
    Department of State
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    FINAL UPDATE 09/29/2025. Final update was contingent upon counties completing data reconciliation. This dataset describes the current state of mail ballot requests for the 2025 Municipal Primary Election. It’s a snapshot in time of the current volume of ballot requests across the Commonwealth. The file contains all mail ballot requests except ballot applications that are declined as duplicate.

    This point-in-time transactional data is being published for informational purposes to provide detailed data pertaining to the processing of absentee and mail-in ballots by county election offices. This data is extracted once per day from the Statewide Uniform Registry of Electors (SURE system), and it reflects activity recorded by the counties in the SURE system at the time of the data extraction.

    Please note that county election offices will continue to process ballot applications (as applicable), record ballots, reconcile ballot data, and make corrections when necessary, and this will continue through, and even after, Election Day. Administrative practices for recording transactions in the system will vary by county. For example, some counties record individual transactions as they occur, while others record transactions in batches at specific intervals. These activities may result in substantial changes to a county's reported data from one day to the next. County practices also differ on when cancelled ballot data is entered into the database (i.e., before or after the election). Some counties do not enter cancelled ballot data entirely.

    Additional notes specific to this dataset: • Counties can enter cancellation codes without entering a ballot returned date. • Some cancellation codes are a result of administrative processes, meaning the ballot was never mailed to the voter before it was cancelled (e.g., there was an error when the label was printed). • Confidential and protected voters are not included in this file. • Counties can only enter one cancel code per ballot, even if there are multiple errors. Different counties may vary in what code they choose to use when this arises, or they may choose to use the catch-all category of 'CANC - OTHER'. • Counties may use ‘PEND’ codes as part of their notice and cure practice. These are usually converted to ‘CANC’ codes after the election. However, in situations where PEND codes remain after the election, these should be considered cancelled. • Columns and data codes included in this file have evolved over time. For example, for past elections (e.g., 2020), cancelled ballots were not included in the file. This may make it difficult to compare data from election to election.

    Type of data included in this file: This data includes all mail ballot applications processed by counties, which includes voters on the permanent mail-in and absentee ballot lists. Multiple rows in this data may correspond to the same voter if they submitted more than one application or had a cancelled ballot(s). A deidentified voter ID has been provided to allow data users to identify when rows correspond to the same voter. This ID is randomized and cannot be used to match to SURE, the Full Voter Export, previous iterations of the Statewide Mail Ballot File.

    All application types in this file are considered a type of mail ballot. Some of the applications are considered UOCAVA (Uniformed and Overseas Citizens Absentee Voting Act) or UMOVA (Uniform Military and Overseas Voters Act) ballots. These are listed below:

    • CRI - Civilian - Remote/Isolated • CVO - Civilian Overseas • F - Federal (Unregistered) • M - Military • MRI - Military - Remote/Isolated • V - Veteran • BV - Bedridden Veteran • BVRI - Bedridden Veteran - Remote/Isolated *We may not have all application types in the file for every election.

  13. a

    Registered Business Locations San Francisco from DataSF pulled daily objects...

    • arc-gis-hub-home-arcgishub.hub.arcgis.com
    • hub.arcgis.com
    Updated Oct 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City and County of San Francisco (2025). Registered Business Locations San Francisco from DataSF pulled daily objects [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/datasets/sfgov::registered-business-locations-san-francisco-from-datasf-pulled-daily?layer=1
    Explore at:
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    City and County of San Francisco
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Area covered
    San Francisco,
    Description

    NEW!: Use the new Business Account Number lookup tool. SUMMARYThis dataset includes the locations of businesses that pay taxes to the City and County of San Francisco. Each registered business may have multiple locations and each location is a single row. The Treasurer & Tax Collector’s Office collects this data through business registration applications, account update/closure forms, and taxpayer filings. Business locations marked as “Administratively Closed” have not filed or communicated with TTX for 3 years, or were marked as closed following a notification from another City and County Department. The data is collected to help enforce the Business and Tax Regulations Code including, but not limited to: Article 6, Article 12, Article 12-A, and Article 12-A-1. http://sftreasurer.org/registration.HOW TO USE THIS DATASETSystem migration in 2014: When the City transitioned to a new system in 2014, only active business accounts were migrated. As a result, any businesses that had already closed by that point were not included in the current dataset.2018 account cleanup: In 2018, TTX did a major cleanup of dormant and unresponsive accounts and closed approximately 40,000 inactive businesses.To learn more about using this dataset watch this video.To update your listing or look up your BAN see this FAQ: Registered Business Locations ExplainerData pushed to ArcGIS Online on November 10, 2025 at 6:16 AM by SFGIS.Data from: https://data.sfgov.org/d/g8m3-pdisDescription of dataset columns:

     UniqueID
     Unique formula: @Value(ttxid)-@Value(certificate_number)
    
    
     Business Account Number
     Seven digit number assigned to registered business accounts
    
    
     Location Id
     Location identifier
    
    
     Ownership Name
     Business owner(s) name
    
    
     DBA Name
     Doing Business As Name or Location Name
    
    
     Street Address
     Business location street address
    
    
     City
     Business location city
    
    
     State
     Business location state
    
    
     Source Zipcode
     Business location zip code
    
    
     Business Start Date
     Start date of the business
    
    
     Business End Date
     End date of the business
    
    
     Location Start Date
     Start date at the location
    
    
     Location End Date
     End date at the location, if closed
    
    
     Administratively Closed
     Business locations marked as “Administratively Closed” have not filed or communicated with TTX for 3 years, or were marked as closed following a notification from another City and County Department.
    
    
     Mail Address
     Address for mailing
    
    
     Mail City
     Mailing address city
    
    
    
     Mail State
     Mailing address state
    
    
    
     Mail Zipcode
     Mailing address zipcode
    
    
     NAICS Code
     The North American Industry Classification System (NAICS) is a standard used by Federal statistical agencies for the purpose of collecting, analyzing and publishing statistical data related to the U.S. business economy. A subset of these are options on the business registration form used in the administration of the City and County's tax code. The registrant indicates the business activity on the City and County's tax registration forms.
    

    See NAICS Codes tab in the attached data dictionary under About > Attachments.

     NAICS Code Description
     The Business Activity that the NAICS code maps on to ("Multiple" if there are multiple codes indicated for the business).
    
    
     NAICS Code Descriptions List
     A list of all NAICS code descriptions separated by semi-colon
    
    
     LIC Code
     The LIC code of the business, if multiple, separated by spaces
    
    
     LIC Code Description
     The LIC code description ("Multiple" if there are multiple codes for a business)
    
    
     LIC Code Descriptions List
     A list of all LIC code descriptions separated by semi-colon
    
    
     Parking Tax
     Whether or not this business pays the parking tax
    
    
     Transient Occupancy Tax
     Whether or not this business pays the transient occupancy tax
    
    
     Business Location
     The latitude and longitude of the business location for mapping purposes.
    
    
     Business Corridor
     The Business Corridor in which the the business location falls, if it is in one. Not all business locations are in a corridor.
    

    Boundary reference: https://data.sfgov.org/d/h7xa-2xwk

     Neighborhoods - Analysis Boundaries
     The Analysis Neighborhood in which the business location falls. Not applicable outside of San Francisco.
    

    Boundary reference: https://data.sfgov.org/d/p5b7-5n3h

     Supervisor District
     The Supervisor District in which the business location falls. Not applicable outside of San Francisco. Boundary reference: https://data.sfgov.org/d/xz9b-wyfc
    
    
     Community Benefit District
     The Community Benefit District in which the business location falls. Not applicable outside of San Francisco. Boundary reference: https://data.sfgov.org/d/c28a-f6gs
    
    
     data_as_of
     Timestamp the data was updated in the source system
    
    
     data_loaded_at
     Timestamp the data was loaded here (open data portal)
    
    
     SF Find Neighborhoods
     This column was automatically created in order to record in what polygon from the dataset 'SF Find Neighborhoods' (6qbp-sg9q) the point in column 'location' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
    
    
     Current Police Districts
     This column was automatically created in order to record in what polygon from the dataset 'Current Police Districts' (qgnn-b9vv) the point in column 'location' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
    
    
     Current Supervisor Districts
     This column was automatically created in order to record in what polygon from the dataset 'Current Supervisor Districts' (26cr-cadq) the point in column 'location' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
    
    
     Analysis Neighborhoods
     This column was automatically created in order to record in what polygon from the dataset 'Analysis Neighborhoods' (ajp5-b2md) the point in column 'location' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
    
    
     Neighborhoods
     This column was automatically created in order to record in what polygon from the dataset 'Neighborhoods' (jwn9-ihcz) the point in column 'location' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
    

    Note: If no description was provided by DataSF, the cell is left blank. See the source data for more information.

  14. Newspapers-Indian Daily Mail-1946 to 1947

    • data.gov.sg
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library Board (2024). Newspapers-Indian Daily Mail-1946 to 1947 [Dataset]. https://data.gov.sg/datasets/d_434d294555cbb371da63e9770d5b4ca1/view
    Explore at:
    Dataset updated
    Jun 6, 2024
    Dataset authored and provided by
    National Library Boardhttp://www.nlb.gov.sg/
    License

    https://data.gov.sg/open-data-licencehttps://data.gov.sg/open-data-licence

    Time period covered
    Feb 2024 - Feb 2025
    Area covered
    India
    Description

    Dataset from National Library Board. For more information, visit https://data.gov.sg/datasets/d_434d294555cbb371da63e9770d5b4ca1/view

  15. h

    custom_summarization_dataset

    • huggingface.co
    Updated Sep 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junseong Park (2024). custom_summarization_dataset [Dataset]. https://huggingface.co/datasets/rasauq1122/custom_summarization_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 16, 2024
    Authors
    Junseong Park
    Description

    Dataset Card for Custom Text Dataset

      Dataset Name
    

    Custom Text Dataset

      Overview
    

    This dataset contains text data for training summarization models. The data is collected from CNN/daily mail.

      Composition
    

    Number of records: 100 Fields: text, label

      Collection Process
    

    CNN/daily mail

      Preprocessing
    

    nothing

      How to Use
    

    from datasets import load_dataset dataset = load_dataset("path_to_dataset")

    for example in… See the full description on the dataset page: https://huggingface.co/datasets/rasauq1122/custom_summarization_dataset.

  16. ⛳️ Golf Play Dataset Extended

    • kaggle.com
    zip
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samy Baladram (2023). ⛳️ Golf Play Dataset Extended [Dataset]. https://www.kaggle.com/datasets/samybaladram/golf-play-extended/code
    Explore at:
    zip(223298 bytes)Available download formats
    Dataset updated
    Nov 3, 2023
    Authors
    Samy Baladram
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    https://i.imgur.com/pK2luKY.png" alt="Imgur">

    Overview

    This Extended Golf Play Dataset is a rich and detailed collection designed to extend the classic golf dataset. It includes a variety of features to cover many aspects of data science. This dataset is especially useful for teaching because it offers many small datasets within it, each one created for a different learning purpose.

    Core Features:

    • Outlook: Type of weather (sunny, cloudy, rainy, snowy).
    • Temperature: How hot or cold it is, in Celsius.
    • Humidity: How much moisture is in the air, as a percent.
    • Windy: If it is windy or not (True or False).
    • Play: If golf was played or not (Yes or No).

    Extra Features:

    • ID: Each player's unique number.
    • Date: The day the data was recorded.
    • Weekday: What day of the week it is.
    • Holiday: If the day is a special holiday (Yes or No).
    • Season: Time of the year (spring, summer, autumn, winter).
    • Crowded-ness: How crowded the golf course is.
    • PlayTime-Hour: How long people played golf, in hours.

    Text Features:

    • Review: What players said about their day at golf.
    • EmailCampaign: Emails the golf place sent every day.
    • MaintenanceTasks: Work done to take care of the golf course.

    Mini Datasets Collection

    This dataset includes a special set of mini datasets: - Each mini dataset focuses on a specific teaching point, like how to clean data or how to combine datasets. - They're perfect for beginners to practice with real examples. - Along with these datasets, you'll find notebooks with step-by-step guides that show you how to use the data.

    Learning With This Dataset

    Students can use this dataset to learn many skills: - Seeing Data: Learn how to make graphs and see patterns. - Sorting Data: Find out which data helps to predict if golf will be played. - Finding Odd Data: Spot data that doesn't look right. - Understanding Data Over Time: Look at how things change day by day or month by month. - Grouping Data: Learn how to put similar days together. - Learning From Text: Use players' reviews to get more insights. - Making Recommendations: Suggest the best time to play golf based on past data.

    Who Can Use This Dataset

    This dataset is for everyone: - New Learners: It's easy to understand and has guides to help you learn. - Teachers: Great for classes on how to see and understand data. - Researchers: Good for testing new ways to analyze data.

    Disclaimer

    This dataset can be shared and used by anyone under the Creative Commons Attribution 4.0 International License (CC BY 4.0). (Illustrations are AI-generated).

    https://i.imgur.com/2I2U2em.png" alt="Imgur">

  17. US Stock Market Giants: Top Companies Stocks Data

    • kaggle.com
    zip
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Azhar Saleem (2024). US Stock Market Giants: Top Companies Stocks Data [Dataset]. https://www.kaggle.com/datasets/azharsaleem/us-stock-market-giants-top-companies-stocks-data
    Explore at:
    zip(4730245 bytes)Available download formats
    Dataset updated
    Nov 8, 2024
    Authors
    Azhar Saleem
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Stock Data of Top USA Companies: Apple, Tesla, Amazon

    👨‍💻 Author: Azhar Saleem

    "https://github.com/azharsaleem18" target="_blank"> https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github" alt="GitHub Profile"> "https://www.kaggle.com/azharsaleem" target="_blank"> https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle" alt="Kaggle Profile"> "https://www.linkedin.com/in/azhar-saleem/" target="_blank"> https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin" alt="LinkedIn Profile">
    "https://www.youtube.com/@AzharSaleem19" target="_blank"> https://img.shields.io/badge/YouTube-Profile-red?style=for-the-badge&logo=youtube" alt="YouTube Profile"> "https://www.facebook.com/azhar.saleem1472/" target="_blank"> https://img.shields.io/badge/Facebook-Profile-blue?style=for-the-badge&logo=facebook" alt="Facebook Profile"> "https://www.tiktok.com/@azhar_saleem18" target="_blank"> https://img.shields.io/badge/TikTok-Profile-blue?style=for-the-badge&logo=tiktok" alt="TikTok Profile">
    "https://twitter.com/azhar_saleem18" target="_blank"> https://img.shields.io/badge/Twitter-Profile-blue?style=for-the-badge&logo=twitter" alt="Twitter Profile"> "https://www.instagram.com/azhar_saleem18/" target="_blank"> https://img.shields.io/badge/Instagram-Profile-blue?style=for-the-badge&logo=instagram" alt="Instagram Profile"> "mailto:azharsaleem6@gmail.com"> https://img.shields.io/badge/Email-Contact%20Me-red?style=for-the-badge&logo=gmail" alt="Email Contact">

    Dataset Description

    This dataset provides daily stock data for some of the top companies in the USA stock market, including major players like Apple, Microsoft, Amazon, Tesla, and others. The data is collected from Yahoo Finance, covering each company’s historical data from its starting date until today. This comprehensive dataset enables in-depth analysis of key financial indicators and stock trends for each company, making it valuable for multiple applications.

    Column Descriptions

    The dataset contains the following columns, consistent across all companies:

    • Date: The date of the stock data entry.
    • Open: The stock's opening price for the day.
    • High: The highest price reached during the trading day.
    • Low: The lowest price during the trading day.
    • Close: The stock’s closing price for the day.
    • Volume: The total number of shares traded on that day.
    • Dividends: Any dividends paid out on that day.
    • Stock Splits: Records stock split events, if any, on that day.

    Potential Use Cases

    1. Machine Learning & Deep Learning:

      • Stock Price Prediction: Use historical prices to train models for forecasting future stock prices.
      • Sentiment Analysis and Price Correlation: Combine with external sentiment data to predict price movements based on market sentiment.
      • Anomaly Detection: Detect unusual price patterns or volume spikes using classification algorithms.
    2. Data Science:

      • Trend Analysis: Identify long-term trends for each company or compare trends between companies.
      • Volatility Analysis: Calculate volatility to assess risk and return patterns over time.
      • Correlation Analysis: Compare stock performance across companies to study market relationships.
    3. Data Analysis:

      • Historical Performance: Review historical data to understand growth trends, market impact of stock splits, and dividends.
      • Seasonal Patterns: Analyze data for seasonal trends or recurring patterns across years.
      • Investment Strategy Backtesting: Test various investment strategies based on historical data to assess potential profitability.
    4. Financial Research:

      • Economic Impact Studies: Investigate how major events affected stock prices across top companies.
      • Sector-Specific Analysis: Identify performance differences across sectors, such as tech, healthcare, and retail.

    This dataset is a powerful tool for analysts, researchers, and financial enthusiasts, offering versatility across multiple domains from stock analysis to algorithmic trading models.

  18. Student Attendance - Texas Schools (2020-2021)

    • kaggle.com
    zip
    Updated May 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Ortiz (2021). Student Attendance - Texas Schools (2020-2021) [Dataset]. https://www.kaggle.com/chrisiortiz/school-attendance-in-texas-covid-weather-ses
    Explore at:
    zip(1622770 bytes)Available download formats
    Dataset updated
    May 10, 2021
    Authors
    Christian Ortiz
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Texas
    Description

    ALL FILES ARE LOCATED AT MY REPOSITORY: https://github.com/christianio123/TexasAttendance

    Context

    I was curious about factors affecting school attendance so I gathered data from school districts around Texas to have a better idea.

    Topic/Purpose

    The purpose of the project is to help determine factors associated with student attendance in the state of Texas. No population is targeted as an audience for the project, however, anyone associated in education may find the dataset used (and other data attained but not used) helpful in any questions they may have regarding student attendance in Texas for the first two months of the 2020-2021 academic school year. This topic was targeted specifically due to the abnormalities in the current academic school year.

    Data

    Daily Student Attendance

    Majority of the data in this project was collected by school districts around the state of Texas, public census information, and public COVID 19 data. To attain student attendance information, an email was sent out to 40 school districts around the state of Texas on November 2nd, 2020 using the Freedom of Information Act (FOIA). Of those districts, 19 responded with the requested data, while other districts required purchase of the data due to the number of hours associated with labor. Due to ambiguity in the original message sent to districts, varying types of data were collected. The major difference between the data received was the “daily” records of student attendance and a “summary” of student attendance records so far, this academic school year. School districts took between 10 to 15 business days to respond, not including the holidays. The focus of this project is “daily student attendance” in order to find relationships or any influences from external or internal factors on any given school day. Therefore, of the 19 school districts that responded, 11 sent the appropriate data.
    The 11 school districts that sent data were (1) Conroe ISD, (2) Cypress-Fairbanks ISD, (3) Floydada ISD, (4) Fort Worth ISD, (5) Pasadena ISD, (6) Snook ISD, (7) Socorro ISD, (8) Klein ISD, (9) Garland ISD, (10) Dallas ISD, and (11) Katy ISD. However, even within these datasets, there were discrepancies, that is, three school districts sent daily attendance data including student grade level but one school district did not include any other information. Also, of the 11 school districts, nine school districts included student attendance broken down by school while three other school districts only had student attendance with no other attributes. This information is important to explain certain steps in analysis preparation later. Variables used from school district datasets included (a) dates, (b) weekdays, (c) school name, (d) school type, (e) district, and (f) grade level.
    

    School Information, County Description, Metropolitan vs. Non-Metropolitan

      In addition to daily student attendance data, two other datasets were used from the Texas Education Agency with data about each school and school district. In one dataset, “Current Schools”, information about each school in the state of Texas was given such as address, principal, county name, district number and much more as of May 2020. From this dataset, variables selected include (a) school name, (b) school zip, (3) district number, (4) and school type. In the second dataset, “District Type”, attributes of each school district were given such as whether the school district was considered major urban, independent town, or a rural area. From “District Type” dataset, selected variables used were (a) district, district number, Texas Education Agency (TEA) description, and National Center of Education Statistics (NCES). To determine if a county is metropolitan or non-metropolitan, a dataset from the Texas Health and Human Services was used. Selected variables from this dataset include (a) county name and (b) metro area. 
    

    Other Factors: COVID-19

     Student attendance has been noticeably different this academic school year, therefore live COVID-19 data was attained from the New York Times to examine for any relationship. This dataset is updated daily with data being available in three formats (country, state, and county). From this dataset, variables selected were both COVID-19 cases by state, and by county.
    

    Other Factors: Demographics

    Each school has a unique student population, therefore census data from 2018 (with best estimate of today’s current population) was used to find the makeup of the population surrounding a school by zip code. From the census data, variables selected were zip code, race/ethnicity, medium income, unemployment rate, and education. These variables were selected to determine differences between school attendance based on the makeup of the population surrounding the school.
    

    Other Factors: Weather

      Weather seems to have an impact on student attendance at schools, so weather data has been included based on county measures.
    
  19. 🌎 Location Intelligence Data | From Google Map

    • kaggle.com
    zip
    Updated Apr 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Azhar Saleem (2024). 🌎 Location Intelligence Data | From Google Map [Dataset]. https://www.kaggle.com/datasets/azharsaleem/location-intelligence-data-from-google-map
    Explore at:
    zip(1911275 bytes)Available download formats
    Dataset updated
    Apr 21, 2024
    Authors
    Azhar Saleem
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    👨‍💻 Author: Azhar Saleem

    "https://github.com/azharsaleem18" target="_blank"> https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github" alt="GitHub Profile"> "https://www.kaggle.com/azharsaleem" target="_blank"> https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle" alt="Kaggle Profile"> "https://www.linkedin.com/in/azhar-saleem/" target="_blank"> https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin" alt="LinkedIn Profile">
    "https://www.youtube.com/@AzharSaleem19" target="_blank"> https://img.shields.io/badge/YouTube-Profile-red?style=for-the-badge&logo=youtube" alt="YouTube Profile"> "https://www.facebook.com/azhar.saleem1472/" target="_blank"> https://img.shields.io/badge/Facebook-Profile-blue?style=for-the-badge&logo=facebook" alt="Facebook Profile"> "https://www.tiktok.com/@azhar_saleem18" target="_blank"> https://img.shields.io/badge/TikTok-Profile-blue?style=for-the-badge&logo=tiktok" alt="TikTok Profile">
    "https://twitter.com/azhar_saleem18" target="_blank"> https://img.shields.io/badge/Twitter-Profile-blue?style=for-the-badge&logo=twitter" alt="Twitter Profile"> "https://www.instagram.com/azhar_saleem18/" target="_blank"> https://img.shields.io/badge/Instagram-Profile-blue?style=for-the-badge&logo=instagram" alt="Instagram Profile"> "mailto:azharsaleem6@gmail.com"> https://img.shields.io/badge/Email-Contact%20Me-red?style=for-the-badge&logo=gmail" alt="Email Contact">

    Dataset Overview

    Welcome to the Google Places Comprehensive Business Dataset! This dataset has been meticulously scraped from Google Maps and presents extensive information about businesses across several countries. Each entry in the dataset provides detailed insights into business operations, location specifics, customer interactions, and much more, making it an invaluable resource for data analysts and scientists looking to explore business trends, geographic data analysis, or consumer behaviour patterns.

    Key Features

    • Business Details: Includes unique identifiers, names, and contact information.
    • Geolocation Data: Precise latitude and longitude for pinpointing business locations on a map.
    • Operational Timings: Detailed opening and closing hours for each day of the week, allowing analysis of business activity patterns.
    • Customer Engagement: Data on review counts and ratings, offering insights into customer satisfaction and business popularity.
    • Additional Attributes: Links to business websites, time zone information, and country-specific details enrich the dataset for comprehensive analysis.

    Potential Use Cases

    This dataset is ideal for a variety of analytical projects, including: - Market Analysis: Understand business distribution and popularity across different regions. - Customer Sentiment Analysis: Explore relationships between customer ratings and business characteristics. - Temporal Trend Analysis: Analyze patterns of business activity throughout the week. - Geospatial Analysis: Integrate with mapping software to visualise business distribution or cluster businesses based on location.

    Dataset Structure

    The dataset contains 46 columns, providing a thorough profile for each listed business. Key columns include:

    • business_id: A unique Google Places identifier for each business, ensuring distinct entries.
    • phone_number: The contact number associated with the business. It provides a direct means of communication.
    • name: The official name of the business as listed on Google Maps.
    • full_address: The complete postal address of the business, including locality and geographic details.
    • latitude: The geographic latitude coordinate of the business location, useful for mapping and spatial analysis.
    • longitude: The geographic longitude coordinate of the business location.
    • review_count: The total number of reviews the business has received on Google Maps.
    • rating: The average user rating out of 5 for the business, reflecting customer satisfaction.
    • timezone: The world timezone the business is located in, important for temporal analysis.
    • website: The official website URL of the business, providing further information and contact options.
    • category: The category or type of service the business provides, such as restaurant, museum, etc.
    • claim_status: Indicates whether the business listing has been claimed by the owner on Google Maps.
    • plus_code: A sho...
  20. OpenAI Summarization Corpus

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). OpenAI Summarization Corpus [Dataset]. https://www.kaggle.com/datasets/thedevastator/openai-summarization-corpus/code
    Explore at:
    zip(35399096 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    OpenAI Summarization Corpus

    Training and Validation Data from TL;DR, CNN, and Daily Mail

    By Huggingface Hub [source]

    About this dataset

    This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.

    To use this dataset for summarization tasks: - Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). - Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. - Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
    - Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content

    Research Ideas

    • Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
    • Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
    • Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: comparisons_validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split | Split of the dataset between training and validation sets. (String) | | extra | Additional information about the given source material available. (String) |

    File: comparisons_train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split ...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nandita Pore (2023). Spam_email_Dataset [Dataset]. https://www.kaggle.com/datasets/nanditapore/spam-email-dataset
Organization logo

Spam_email_Dataset

Uncover Patterns in Email Data to Detect Spam

Explore at:
zip(310235 bytes)Available download formats
Dataset updated
Aug 22, 2023
Authors
Nandita Pore
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Description

This dataset contains synthetic data designed for practicing spam email classification. The dataset includes various features extracted from email messages, such as the email's content, sender and recipient information, as well as metadata like date and time of sending, attachment count, link count, and more.

Columns

  • Email: The email address of the sender.
  • Subject: The subject line of the email.
  • Sender: The email address of the sender.
  • Recipient: The email address of the recipient.
  • Date: The date when the email was sent.
  • Time (24 hours format): The time of day when the email was sent (in 24-hour format).
  • Attachments: The number of attachments present in the email.
  • Link Count: The number of hyperlinks present in the email.
  • Word Count: The total number of words in the email.
  • Uppercase Count: The count of words in uppercase letters.
  • Exclamation Count: The count of exclamation marks in the email.
  • Question Count: The count of question marks in the email.
  • Dollar Count: The count of dollar signs in the email.
  • Punctuation Count: The count of various punctuation marks (e.g., commas, periods).
  • HTML Tags Count: The count of HTML tags in the email.
  • Spam Indicator: A binary label indicating whether the email is spam (1) or not (0).

Usage

This dataset is intended for practicing and experimenting with binary classification tasks, specifically spam email classification. Participants can explore the relationships between different features and the spam indicator to build and evaluate machine learning models for detecting spam emails. Please note that this dataset contains synthetic data generated for educational purposes.

Note

The data in this dataset is synthetic and generated using the Faker library, with random values for demonstration purposes. It does not accurately represent real email content or spam characteristics. Therefore, it's recommended to use this dataset for learning and practicing classification techniques rather than for developing production-level models.

Acknowledgments

This dataset was created for educational purposes and is inspired by real-world email data. It was generated using the Faker library and is released under the Creative Commons License.

Search
Clear search
Close search
Google apps
Main menu