Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains synthetic data designed for practicing spam email classification. The dataset includes various features extracted from email messages, such as the email's content, sender and recipient information, as well as metadata like date and time of sending, attachment count, link count, and more.
This dataset is intended for practicing and experimenting with binary classification tasks, specifically spam email classification. Participants can explore the relationships between different features and the spam indicator to build and evaluate machine learning models for detecting spam emails. Please note that this dataset contains synthetic data generated for educational purposes.
The data in this dataset is synthetic and generated using the Faker library, with random values for demonstration purposes. It does not accurately represent real email content or spam characteristics. Therefore, it's recommended to use this dataset for learning and practicing classification techniques rather than for developing production-level models.
This dataset was created for educational purposes and is inspired by real-world email data. It was generated using the Faker library and is released under the Creative Commons License.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The United States ranked first in the world for sending the most spam emails in a single day as of January 16, 2023, with about eight billion. Czechia and the Netherlands followed closely with 7.7 billion and 7.6 billion spam emails, respectively.
Global trends in internet and email usage The number of email users worldwide grew from 3.9 billion in 2019 to 4.1 billion in 2021 and is projected to reach 4.6 billion by 2025. However, email usage varies across countries. For instance, China and India had the largest internet populations as of July 2021, with over 979 million and 845 million users each, but they used email less frequently than users in the United States or Germany.
Email as the top online activity in the U.S. Email was not only the most common source of spam messages globally as of October 2021, but also the most popular online activity among U.S. internet users in 2019. In fact , email users accounted for 90.9 percent of respondents, surpassing search users, social network users, or digital video viewers.
Data by Cisco Talos
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The "Daily Mail Articles and Highlights" dataset comprises a meticulously curated collection of 8,176 articles, along with their corresponding highlights, sourced directly from the Daily Mail website. This extensive dataset is designed to facilitate the development and training of sophisticated text summarization models that can generate concise and accurate summaries for long-form articles.
The primary goal of this dataset is to train a text summarization model capable of producing brief, yet informative, summaries of given articles. This endeavor is particularly beneficial for readers who seek to grasp the essential points of lengthy articles quickly, thereby enhancing their reading efficiency and comprehension.
The dataset was compiled through an automated web scraping process, ensuring the inclusion of a diverse range of articles spanning various topics and categories. Each article in the dataset is paired with its highlight, which serves as a reference summary. The highlights are succinct extracts that encapsulate the core message of the articles, providing a foundation for training summarization models.
To achieve the goal of creating an efficient summarization system, we employ a combination of cutting-edge technologies and libraries, including:
The summarization model is trained using the collected dataset, following a structured workflow:
The resulting summarization system is designed to automatically produce concise and informative summaries, which can be used in various applications, including:
The "Daily Mail Articles and Highlights" dataset is a valuable resource for advancing the field of text summarization. By leveraging state-of-the-art techniques and libraries, this project aims to develop a robust summarization model that can significantly improve the way we consume and process information. This dataset not only supports the creation of efficient summarization systems but also contributes to the broader goal of making information more accessible and digestible for all.
Facebook
TwitterIn 2023, marketing e-mails in Canada had a click-through rate of 8.68 percent, highest among the selected countries presented in the data set. In Germany, the rate stood at 2.37 percent.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for CNN Dailymail Dataset
Dataset Summary
The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.
Supported Tasks and Leaderboards
'summarization': Versions… See the full description on the dataset page: https://huggingface.co/datasets/abisee/cnn_dailymail.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The total number of user mailboxes in Umeå kommun and how many are active each day of the reporting period. A mailbox is considered active if the user sent or read any email.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The total number of user mailboxes in Umeå kommun and how many are active each day of the reporting period. A mailbox is considered active if the user sent or read any email.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CNN/DailyMail non-anonymized summarization dataset.
There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary
Facebook
TwitterMost organizations today rely on email campaigns for effective communication with users. Email communication is one of the popular ways to pitch products to users and build trustworthy relationships with them. Email campaigns contain different types of CTA (Call To Action). The ultimate goal of email campaigns is to maximize the Click Through Rate (CTR). CTR = No. of users who clicked on at least one of the CTA / No. of emails delivered. This Dataset contains details of body length, sub length, mean paragraph , day of week, is weekend, etc.
Facebook
TwitterPLEASE NOTE: This dataset, which includes all TLC Licensed Drivers who are in good standing and able to drive, is updated every day in the evening between 4-7pm. Please check the 'Last Update Date' field to make sure the list has updated successfully. 'Last Update Date' should show either today or yesterday's date, depending on the time of day. If the list is outdated, please download the most recent list from the link below. http://www1.nyc.gov/assets/tlc/downloads/datasets/tlc_medallion_drivers_active.csv This is a list of drivers with a current TLC Driver License, which authorizes drivers to operate NYC TLC licensed yellow and green taxicabs and for-hire vehicles (FHVs). This list is accurate as of the date and time shown in the Last Date Updated and Last Time Updated fields. Questions about the contents of this dataset can be sent by email to: licensinginquiries@tlc.nyc.gov.
Facebook
TwitterHarness the Power of Fresh New Homeowner Audience Data
Our comprehensive New Homeowner Audience Data file is a meticulously curated compilation of Direct Marketing data, enriched with valuable Email Address Data. This essential resource offers unparalleled access to Consumers and Prospects who have recently moved into new homes or apartments.
Averaging an impressive 1.1 million records monthly, our dataset is continually updated with the latest information, including a dedicated 30-day hotline file for the most recent movers. This ensures you're always working with the freshest and most relevant data.
With an average income surpassing $55K and a high concentration of families, these new homeowners present a prime opportunity for businesses across various sectors. From healthcare providers and home improvement specialists to financial advisors and interior designers, our data empowers you to identify and reach your ideal customer.
Benefit from our flexible pricing options, allowing you to tailor your data acquisition to your specific business needs. Choose from transactional purchases or opt for annual licensing with unlimited use cases for marketing and analytics.
Unlock the full potential of your marketing campaigns with our New Homeowner Audience Data.
Facebook
Twitterhttps://www.usa.gov/government-workshttps://www.usa.gov/government-works
FINAL UPDATE 09/29/2025. Final update was contingent upon counties completing data reconciliation. This dataset describes the current state of mail ballot requests for the 2025 Municipal Primary Election. It’s a snapshot in time of the current volume of ballot requests across the Commonwealth. The file contains all mail ballot requests except ballot applications that are declined as duplicate.
This point-in-time transactional data is being published for informational purposes to provide detailed data pertaining to the processing of absentee and mail-in ballots by county election offices. This data is extracted once per day from the Statewide Uniform Registry of Electors (SURE system), and it reflects activity recorded by the counties in the SURE system at the time of the data extraction.
Please note that county election offices will continue to process ballot applications (as applicable), record ballots, reconcile ballot data, and make corrections when necessary, and this will continue through, and even after, Election Day. Administrative practices for recording transactions in the system will vary by county. For example, some counties record individual transactions as they occur, while others record transactions in batches at specific intervals. These activities may result in substantial changes to a county's reported data from one day to the next. County practices also differ on when cancelled ballot data is entered into the database (i.e., before or after the election). Some counties do not enter cancelled ballot data entirely.
Additional notes specific to this dataset: • Counties can enter cancellation codes without entering a ballot returned date. • Some cancellation codes are a result of administrative processes, meaning the ballot was never mailed to the voter before it was cancelled (e.g., there was an error when the label was printed). • Confidential and protected voters are not included in this file. • Counties can only enter one cancel code per ballot, even if there are multiple errors. Different counties may vary in what code they choose to use when this arises, or they may choose to use the catch-all category of 'CANC - OTHER'. • Counties may use ‘PEND’ codes as part of their notice and cure practice. These are usually converted to ‘CANC’ codes after the election. However, in situations where PEND codes remain after the election, these should be considered cancelled. • Columns and data codes included in this file have evolved over time. For example, for past elections (e.g., 2020), cancelled ballots were not included in the file. This may make it difficult to compare data from election to election.
Type of data included in this file: This data includes all mail ballot applications processed by counties, which includes voters on the permanent mail-in and absentee ballot lists. Multiple rows in this data may correspond to the same voter if they submitted more than one application or had a cancelled ballot(s). A deidentified voter ID has been provided to allow data users to identify when rows correspond to the same voter. This ID is randomized and cannot be used to match to SURE, the Full Voter Export, previous iterations of the Statewide Mail Ballot File.
All application types in this file are considered a type of mail ballot. Some of the applications are considered UOCAVA (Uniformed and Overseas Citizens Absentee Voting Act) or UMOVA (Uniform Military and Overseas Voters Act) ballots. These are listed below:
• CRI - Civilian - Remote/Isolated • CVO - Civilian Overseas • F - Federal (Unregistered) • M - Military • MRI - Military - Remote/Isolated • V - Veteran • BV - Bedridden Veteran • BVRI - Bedridden Veteran - Remote/Isolated *We may not have all application types in the file for every election.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
NEW!: Use the new Business Account Number lookup tool. SUMMARYThis dataset includes the locations of businesses that pay taxes to the City and County of San Francisco. Each registered business may have multiple locations and each location is a single row. The Treasurer & Tax Collector’s Office collects this data through business registration applications, account update/closure forms, and taxpayer filings. Business locations marked as “Administratively Closed” have not filed or communicated with TTX for 3 years, or were marked as closed following a notification from another City and County Department. The data is collected to help enforce the Business and Tax Regulations Code including, but not limited to: Article 6, Article 12, Article 12-A, and Article 12-A-1. http://sftreasurer.org/registration.HOW TO USE THIS DATASETSystem migration in 2014: When the City transitioned to a new system in 2014, only active business accounts were migrated. As a result, any businesses that had already closed by that point were not included in the current dataset.2018 account cleanup: In 2018, TTX did a major cleanup of dormant and unresponsive accounts and closed approximately 40,000 inactive businesses.To learn more about using this dataset watch this video.To update your listing or look up your BAN see this FAQ: Registered Business Locations ExplainerData pushed to ArcGIS Online on November 10, 2025 at 6:16 AM by SFGIS.Data from: https://data.sfgov.org/d/g8m3-pdisDescription of dataset columns:
UniqueID
Unique formula: @Value(ttxid)-@Value(certificate_number)
Business Account Number
Seven digit number assigned to registered business accounts
Location Id
Location identifier
Ownership Name
Business owner(s) name
DBA Name
Doing Business As Name or Location Name
Street Address
Business location street address
City
Business location city
State
Business location state
Source Zipcode
Business location zip code
Business Start Date
Start date of the business
Business End Date
End date of the business
Location Start Date
Start date at the location
Location End Date
End date at the location, if closed
Administratively Closed
Business locations marked as “Administratively Closed” have not filed or communicated with TTX for 3 years, or were marked as closed following a notification from another City and County Department.
Mail Address
Address for mailing
Mail City
Mailing address city
Mail State
Mailing address state
Mail Zipcode
Mailing address zipcode
NAICS Code
The North American Industry Classification System (NAICS) is a standard used by Federal statistical agencies for the purpose of collecting, analyzing and publishing statistical data related to the U.S. business economy. A subset of these are options on the business registration form used in the administration of the City and County's tax code. The registrant indicates the business activity on the City and County's tax registration forms.
See NAICS Codes tab in the attached data dictionary under About > Attachments.
NAICS Code Description
The Business Activity that the NAICS code maps on to ("Multiple" if there are multiple codes indicated for the business).
NAICS Code Descriptions List
A list of all NAICS code descriptions separated by semi-colon
LIC Code
The LIC code of the business, if multiple, separated by spaces
LIC Code Description
The LIC code description ("Multiple" if there are multiple codes for a business)
LIC Code Descriptions List
A list of all LIC code descriptions separated by semi-colon
Parking Tax
Whether or not this business pays the parking tax
Transient Occupancy Tax
Whether or not this business pays the transient occupancy tax
Business Location
The latitude and longitude of the business location for mapping purposes.
Business Corridor
The Business Corridor in which the the business location falls, if it is in one. Not all business locations are in a corridor.
Boundary reference: https://data.sfgov.org/d/h7xa-2xwk
Neighborhoods - Analysis Boundaries
The Analysis Neighborhood in which the business location falls. Not applicable outside of San Francisco.
Boundary reference: https://data.sfgov.org/d/p5b7-5n3h
Supervisor District
The Supervisor District in which the business location falls. Not applicable outside of San Francisco. Boundary reference: https://data.sfgov.org/d/xz9b-wyfc
Community Benefit District
The Community Benefit District in which the business location falls. Not applicable outside of San Francisco. Boundary reference: https://data.sfgov.org/d/c28a-f6gs
data_as_of
Timestamp the data was updated in the source system
data_loaded_at
Timestamp the data was loaded here (open data portal)
SF Find Neighborhoods
This column was automatically created in order to record in what polygon from the dataset 'SF Find Neighborhoods' (6qbp-sg9q) the point in column 'location' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
Current Police Districts
This column was automatically created in order to record in what polygon from the dataset 'Current Police Districts' (qgnn-b9vv) the point in column 'location' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
Current Supervisor Districts
This column was automatically created in order to record in what polygon from the dataset 'Current Supervisor Districts' (26cr-cadq) the point in column 'location' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
Analysis Neighborhoods
This column was automatically created in order to record in what polygon from the dataset 'Analysis Neighborhoods' (ajp5-b2md) the point in column 'location' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
Neighborhoods
This column was automatically created in order to record in what polygon from the dataset 'Neighborhoods' (jwn9-ihcz) the point in column 'location' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
Note: If no description was provided by DataSF, the cell is left blank. See the source data for more information.
Facebook
Twitterhttps://data.gov.sg/open-data-licencehttps://data.gov.sg/open-data-licence
Dataset from National Library Board. For more information, visit https://data.gov.sg/datasets/d_434d294555cbb371da63e9770d5b4ca1/view
Facebook
TwitterDataset Card for Custom Text Dataset
Dataset Name
Custom Text Dataset
Overview
This dataset contains text data for training summarization models. The data is collected from CNN/daily mail.
Composition
Number of records: 100 Fields: text, label
Collection Process
CNN/daily mail
Preprocessing
nothing
How to Use
from datasets import load_dataset dataset = load_dataset("path_to_dataset")
for example in… See the full description on the dataset page: https://huggingface.co/datasets/rasauq1122/custom_summarization_dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://i.imgur.com/pK2luKY.png" alt="Imgur">
This Extended Golf Play Dataset is a rich and detailed collection designed to extend the classic golf dataset. It includes a variety of features to cover many aspects of data science. This dataset is especially useful for teaching because it offers many small datasets within it, each one created for a different learning purpose.
This dataset includes a special set of mini datasets: - Each mini dataset focuses on a specific teaching point, like how to clean data or how to combine datasets. - They're perfect for beginners to practice with real examples. - Along with these datasets, you'll find notebooks with step-by-step guides that show you how to use the data.
Students can use this dataset to learn many skills: - Seeing Data: Learn how to make graphs and see patterns. - Sorting Data: Find out which data helps to predict if golf will be played. - Finding Odd Data: Spot data that doesn't look right. - Understanding Data Over Time: Look at how things change day by day or month by month. - Grouping Data: Learn how to put similar days together. - Learning From Text: Use players' reviews to get more insights. - Making Recommendations: Suggest the best time to play golf based on past data.
This dataset is for everyone: - New Learners: It's easy to understand and has guides to help you learn. - Teachers: Great for classes on how to see and understand data. - Researchers: Good for testing new ways to analyze data.
This dataset can be shared and used by anyone under the Creative Commons Attribution 4.0 International License (CC BY 4.0). (Illustrations are AI-generated).
https://i.imgur.com/2I2U2em.png" alt="Imgur">
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides daily stock data for some of the top companies in the USA stock market, including major players like Apple, Microsoft, Amazon, Tesla, and others. The data is collected from Yahoo Finance, covering each company’s historical data from its starting date until today. This comprehensive dataset enables in-depth analysis of key financial indicators and stock trends for each company, making it valuable for multiple applications.
The dataset contains the following columns, consistent across all companies:
Machine Learning & Deep Learning:
Data Science:
Data Analysis:
Financial Research:
This dataset is a powerful tool for analysts, researchers, and financial enthusiasts, offering versatility across multiple domains from stock analysis to algorithmic trading models.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
ALL FILES ARE LOCATED AT MY REPOSITORY: https://github.com/christianio123/TexasAttendance
I was curious about factors affecting school attendance so I gathered data from school districts around Texas to have a better idea.
The purpose of the project is to help determine factors associated with student attendance in the state of Texas. No population is targeted as an audience for the project, however, anyone associated in education may find the dataset used (and other data attained but not used) helpful in any questions they may have regarding student attendance in Texas for the first two months of the 2020-2021 academic school year. This topic was targeted specifically due to the abnormalities in the current academic school year.
Majority of the data in this project was collected by school districts around the state of Texas, public census information, and public COVID 19 data. To attain student attendance information, an email was sent out to 40 school districts around the state of Texas on November 2nd, 2020 using the Freedom of Information Act (FOIA). Of those districts, 19 responded with the requested data, while other districts required purchase of the data due to the number of hours associated with labor. Due to ambiguity in the original message sent to districts, varying types of data were collected. The major difference between the data received was the “daily” records of student attendance and a “summary” of student attendance records so far, this academic school year. School districts took between 10 to 15 business days to respond, not including the holidays. The focus of this project is “daily student attendance” in order to find relationships or any influences from external or internal factors on any given school day. Therefore, of the 19 school districts that responded, 11 sent the appropriate data.
The 11 school districts that sent data were (1) Conroe ISD, (2) Cypress-Fairbanks ISD, (3) Floydada ISD, (4) Fort Worth ISD, (5) Pasadena ISD, (6) Snook ISD, (7) Socorro ISD, (8) Klein ISD, (9) Garland ISD, (10) Dallas ISD, and (11) Katy ISD. However, even within these datasets, there were discrepancies, that is, three school districts sent daily attendance data including student grade level but one school district did not include any other information. Also, of the 11 school districts, nine school districts included student attendance broken down by school while three other school districts only had student attendance with no other attributes. This information is important to explain certain steps in analysis preparation later. Variables used from school district datasets included (a) dates, (b) weekdays, (c) school name, (d) school type, (e) district, and (f) grade level.
In addition to daily student attendance data, two other datasets were used from the Texas Education Agency with data about each school and school district. In one dataset, “Current Schools”, information about each school in the state of Texas was given such as address, principal, county name, district number and much more as of May 2020. From this dataset, variables selected include (a) school name, (b) school zip, (3) district number, (4) and school type. In the second dataset, “District Type”, attributes of each school district were given such as whether the school district was considered major urban, independent town, or a rural area. From “District Type” dataset, selected variables used were (a) district, district number, Texas Education Agency (TEA) description, and National Center of Education Statistics (NCES). To determine if a county is metropolitan or non-metropolitan, a dataset from the Texas Health and Human Services was used. Selected variables from this dataset include (a) county name and (b) metro area.
Student attendance has been noticeably different this academic school year, therefore live COVID-19 data was attained from the New York Times to examine for any relationship. This dataset is updated daily with data being available in three formats (country, state, and county). From this dataset, variables selected were both COVID-19 cases by state, and by county.
Each school has a unique student population, therefore census data from 2018 (with best estimate of today’s current population) was used to find the makeup of the population surrounding a school by zip code. From the census data, variables selected were zip code, race/ethnicity, medium income, unemployment rate, and education. These variables were selected to determine differences between school attendance based on the makeup of the population surrounding the school.
Weather seems to have an impact on student attendance at schools, so weather data has been included based on county measures.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Welcome to the Google Places Comprehensive Business Dataset! This dataset has been meticulously scraped from Google Maps and presents extensive information about businesses across several countries. Each entry in the dataset provides detailed insights into business operations, location specifics, customer interactions, and much more, making it an invaluable resource for data analysts and scientists looking to explore business trends, geographic data analysis, or consumer behaviour patterns.
This dataset is ideal for a variety of analytical projects, including: - Market Analysis: Understand business distribution and popularity across different regions. - Customer Sentiment Analysis: Explore relationships between customer ratings and business characteristics. - Temporal Trend Analysis: Analyze patterns of business activity throughout the week. - Geospatial Analysis: Integrate with mapping software to visualise business distribution or cluster businesses based on location.
The dataset contains 46 columns, providing a thorough profile for each listed business. Key columns include:
business_id: A unique Google Places identifier for each business, ensuring distinct entries.phone_number: The contact number associated with the business. It provides a direct means of communication.name: The official name of the business as listed on Google Maps.full_address: The complete postal address of the business, including locality and geographic details.latitude: The geographic latitude coordinate of the business location, useful for mapping and spatial analysis.longitude: The geographic longitude coordinate of the business location.review_count: The total number of reviews the business has received on Google Maps.rating: The average user rating out of 5 for the business, reflecting customer satisfaction.timezone: The world timezone the business is located in, important for temporal analysis.website: The official website URL of the business, providing further information and contact options.category: The category or type of service the business provides, such as restaurant, museum, etc.claim_status: Indicates whether the business listing has been claimed by the owner on Google Maps.plus_code: A sho...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.
To use this dataset for summarization tasks: - Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). - Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. - Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
- Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content
- Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
- Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
- Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: comparisons_validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split | Split of the dataset between training and validation sets. (String) | | extra | Additional information about the given source material available. (String) |
File: comparisons_train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split ...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains synthetic data designed for practicing spam email classification. The dataset includes various features extracted from email messages, such as the email's content, sender and recipient information, as well as metadata like date and time of sending, attachment count, link count, and more.
This dataset is intended for practicing and experimenting with binary classification tasks, specifically spam email classification. Participants can explore the relationships between different features and the spam indicator to build and evaluate machine learning models for detecting spam emails. Please note that this dataset contains synthetic data generated for educational purposes.
The data in this dataset is synthetic and generated using the Faker library, with random values for demonstration purposes. It does not accurately represent real email content or spam characteristics. Therefore, it's recommended to use this dataset for learning and practicing classification techniques rather than for developing production-level models.
This dataset was created for educational purposes and is inspired by real-world email data. It was generated using the Faker library and is released under the Creative Commons License.