100+ datasets found

CoEdIT
kaggle.com
huggingface.co
zip
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). CoEdIT [Dataset]. https://www.kaggle.com/datasets/thedevastator/coedit-nlp-editing-dataset
Explore at:
zip(4681073 bytes)Available download formats
Dataset updated
Nov 26, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
CoEdIT

Enhancing AI Text Editing Through 69,000 Instances

By Huggingface Hub [source]

About this dataset

This dataset provides 69,000 instances of natural language processing (NLP) editing tasks to help researchers develop more effective AI text-editing models. Compiled into a convenient JSON format, this collection offers easy access so that researchers have the tools they need to create groundbreaking AI models that efficiently and effectively redefine natural language processing. This is your chance to be at the forefront of NLP technology and make history through innovative AI capabilities. So join in and unlock a world of possibilities with CoEdIT's Text Editing Dataset!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Familiarize yourself with the format of the dataset by taking a look at the columns: task, src, tgt. You’ll see that each row in this dataset contains a specific NLP editing task as well as source text (src) and target text (tgt) which displays what should result from that editing task.

Import the JSON file of this dataset into your machine learning environment or analyses software toolbox of choice. Some popular options include Python's Pandas library and BigQuery on Google Cloud Platforms for larger datasets like this one oryoou can also import them into Excel Toolboxes .

Once you've imported the data into your chosen program, you can now start exploring! Take a look around at various rows to get an idea of how different types of edits need to be made on source text in order to produce target text successfully meeting given criteria depending on needs/ tasks come together; Make sure you read any documents associated with each column helps understand better context before beginning your analysis or coding part

Test out coding solutions which process different types and scales of edits - if understanding how punctuation impacts sentence similarity measures gives key insight into meaning being conveyed then develop code accordingly ,playing around with different methods utilizing common ML/NLP algorithms & libraries like NLTK , etc

5 Finally – now that have tested conceptual ideas begin work creating efficient & effective AI-powered models system using training data specifically catered towards given tasks at hand; Evaluate performance with validation & test datasets prior getting production ready

Research Ideas

Automated Grammar Checking Solutions: This dataset can be used to train machine learning models to detect grammatical errors and suggest proper corrections.

Text Summarization: Using this dataset, researchers can create AI-powered summarization algorithms that summarize long-form passages into shorter summaries while preserving accuracy and readability

Natural Language Generation: This dataset could be used to develop AI solutions that generate accurately formatted natural language sentences when given a prompt or some other form of input

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------------------| | Task | This column describes the task that the dataset is intended to be used for. (String) | | src | This column contains the source text input. (String) | | tgt | This column contains the target text output. (String) |

File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------------------| | Task | This column describes the task that the dataset is intended to be used for. (String) | | src | This column contains the source text input. (String) | | tgt | This column contains the target text output. (String) ...
Reddit /r/datasets Dataset
kaggle.com
zip
Updated Nov 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit /r/datasets Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-meta-corpus-of-datasets-the-reddit-dataset
Explore at:
zip(9619636 bytes)Available download formats
Dataset updated
Nov 28, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Meta-Corpus of Datasets: The Reddit Dataset

The Complete Collection of Datasets Posted on Reddit

By SocialGrep [source]

About this dataset

A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.

Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.

In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.

You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes

Research Ideas

Finding correlations between different types of datasets

Determining which datasets are most popular on Reddit

Analyzing the sentiments of post and comments on Reddit's /r/datasets board

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |

File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
numerai data V5.0 Universe
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Josef Švenda (2025). numerai data V5.0 Universe [Dataset]. https://www.kaggle.com/datasets/svendaj/numerai-latest-tournament-data
Explore at:
zip(10867970835 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Josef Švenda
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Weekly updated dataset with the latest version of Numerai tournament data. The dataset contains the directory with the name of the latest data version. Currently it is V5.0. The data are downloaded weekly by public Kaggle notebook numerai data whenever new data are available (opening of Saturday round). Upon a change in this notebook output, the dataset is automatically updated. So, you can add this dataset to your notebooks as data source or output of numerai data notebook and you do not need to download it yourself.

Older versions of data are available elsewhere: * V4 and V4.1 - dataset and producing notebook * V4.2 Rain - dataset and producing notebook * V4.3 Midnight - dataset and producing notebook

Text file current_round.txt contains the number of tournament round when data were successfully downloaded.

In addition to all data files provided by Numerai, downloading notebook creates four partitions of non-overlapping eras for training and validation data. These files are stored in f"train_no{split}.parquet" and f"validation_no{split}.parquet" files. Since Round 864 polars library is used to produce downsampled files. Because polars are not using the index concept, the saved data file stores the index id as another column. If you need the same index as in original files you should add following check to your code right after df = pandas.read_parquet(filename): if not ("id" in df.index.names): df.set_index("id", inplace=True)
Youtube Quality Videos Classification
kaggle.com
zip
Updated Oct 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Youtube Quality Videos Classification [Dataset]. https://www.kaggle.com/datasets/thedevastator/youtube-quality-videos-classification/code
Explore at:
zip(712986 bytes)Available download formats
Dataset updated
Oct 21, 2022
Authors
The Devastator
Area covered
YouTube
Description
Youtube Quality Videos Classification

How to Tell If a Video is Good or Bad

About this dataset

This dataset is important as it can help users find good quality videos more easily. The data was collected using the Youtube API and includes a total of _ videos

Columns: Channel title, view count, like count, comment count, definition, caption, subscribers, total views, average polarity score, label

How to use the dataset

In order to use this dataset, you will need to have the following: -A YouTube API key -A text editor (e.g. Notepad++, Sublime Text, etc.)

Once you have collected these items, you can begin using the dataset. Here is a step-by-step guide: 1) Navigate to the folder where you saved the dataset. 2) Right-click on the file and select Open with > Your text editor. 3) copy your YouTube API key and paste it in place of Your_API_Key in line 4 of the code. 4) Save the file and close your text editor. 5) Navigate to the folder in your terminal/command prompt and type jupyter notebook. This will open a Jupyter Notebook in your browser window.

Research Ideas

This dataset can be used for a number of different things including: 1. Finding good quality videos on youtube 2. Determining which videos are more likely to be reputable 3. Helping people find videos they will enjoy

Acknowledgements

The data for this dataset was collected using the Youtube API and includes a total of _ videos

License

See the dataset description for more information.

Columns

File: dataframeclean.csv | Column name | Description | |:-----------------------|:--------------| | **** | | | channelTitle | | | viewCount | | | likeCount | | | commentCount | | | definition | | | caption | | | subscribers | | | totalViews | | | avg polarity score | | | Label | | | pushblishYear | | | durationSecs | | | tagCount | | | title length | | | description length | |

File: ytdataframe.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | **** | | | channelTitle | | | viewCount | | | likeCount | | | commentCount | | | definition | | | caption | | | subscribers | | | totalViews | | | avg polarity score | | | Label | | | title | The title of the video. (String) | | description | A description of the video. (String) | | tags | The tags associated with the video. (String) | | publishedAt | The date and time the video was published. (String) | | favouriteCount | The number of times the video has been favorited. (Integer) | | duration | The length of the video in seconds. (Integer) |

File: ytdataframe2.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | **** | | | channelTitle | | | title | The title of the video. (String) | | description | A description of the video. (String) | | tags | The tags associated with the video. (String) | | publishedAt | The date and time the video was published. (String) | | viewCount | | | **...
Temperature Over Time by State (Starts: 1895)
kaggle.com
zip
Updated Dec 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Temperature Over Time by State (Starts: 1895) [Dataset]. https://www.kaggle.com/datasets/thedevastator/analyzing-u-s-warming-rates-insights-into-climat
Explore at:
zip(4268382 bytes)Available download formats
Dataset updated
Dec 4, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Temperature Over Time by State (Starts: 1895)

State and County Temperature Changes

By Environmental Data [source]

About this dataset

Do you want to know how rising temperatures are changing the contiguous United States? The Washington Post has used National Oceanic and Atmospheric Administration's Climate Divisional Database (nClimDiv) and Gridded 5km GHCN-Daily Temperature and Precipitation Dataset (nClimGrid) data sets to help analyze warming temperatures in all of the Lower 48 states from 1895-2019. To provide this analysis, we calculated annual mean temperature trends in each state and county in the Lower 48 states. Our results can be found within several datasets now available on this repository.

We are offering: Annual average temperatures for counties and states, temperature change estimates for each of the Lower 48-states, temperature change estimates for counties in the contiguous U.S., county temperature change data joined to a shapefile in GeoJSON format, gridded temperature change data for the contiguous U.S. in GeoTiff format - all contained with our dataset! We invite those curious about climate change to explore these data sets based on our analysis over multiple stories published by The Washington Post such as Extreme climate change has arrived in America, Fires, floods and free parking: California’s unending fight against climate change, In fast-warming Minnesota, scientists are trying to plant the forests of the future, This giant climate hot spot is robbing West of its water ,and more!

By accessing our dataset containing columns such as fips code, year range from 1895-2019, three season temperatures (Fall/Spring/Summer/Winter), max warming season temps plus temp recorded total yearly - you can become an active citizen scientist! If publishing a story or graphic work based off this data set please credit The Washington Post with a link back to this repository while sending us an email so that we can track its usage as well - 2cdatawashpost.com.

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The main files provided by this dataset are climdiv_state_year, climdiv_county_year, model_state, model_county , climdiv_national_year ,and model county .geojson . Each file contains different information capturing climate change across different geographies of the United States over time spans from 1895.

Research Ideas

Investigating and mapping the temperatures for all US states over the past 120 years, to observe long-term changes in temperature patterns.

Examining regional biases in warming trends across different US counties and states to help inform resource allocation decisions for climate change mitigation and adaption initiatives.

Utilizing the ClimDiv National Dataset to understand continental-level average annual temperature changes, allowing comparison of global average temperatures with US averages over a long period of time

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: climdiv_state_year.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------| | fips | Federal Information Processing Standard code for each county. (Integer) | | year | Year of the temperature data. (Integer) | | tempc | Temperature change from the previous year. (Float) |

File: climdiv_county_year.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------| | fips | Federal Information Processing Standard code for each county. (Integer) | | year | Year of the temperature data. (Integer) | | tempc | Temperature change from the previous year. (Float) |

File: model_state.csv | Column name | Description | |:------------------...
Book Metadata from BooksToScrape
kaggle.com
zip
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Book Metadata from BooksToScrape [Dataset]. https://www.kaggle.com/datasets/thedevastator/book-metadata-from-bookstoscrape/discussion?sort=undefined
Explore at:
zip(557690 bytes)Available download formats
Dataset updated
Feb 11, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Book Metadata from BooksToScrape

Unlocking Hidden Insights in Reading Material

By [source]

About this dataset

This dataset allows readers to unlock hidden insights into contemporary literature and the books that people are choosing to purchase. It provides comprehensive and powerful data related to a web books retailer, books.toscrape.com, featuring 12 columns of crucial book metadata gathered through web scraping methods in November 2020. Researching publications through this information provides a great sense of insight and understanding into the current reading climate: uncovering emerging trends in what people are buying, reading, rating, and loving worldwide. With this dataset at your disposal you can explore book popularity from a commercial standpoint as well as a creative one; examining publishing preferences from authors' points of view across reviews and genres alike. Dive into discovering the secrets behind book selection habits by delving into topics ranging from rating systems for certain works to pricing analysis for publishers- all fuelled by this carefully organised streamline of data at play here today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

To get started analyzing this dataset with Kaggle notebooks or other tools: - Open up your tool (Kaggle notebook or another tool) that supports reading CSV files
- Import the dataset.csv file into your chosen program - Explore each column individually to better understand what type of book metadata exists within each category – descriptors such as title, image URLs/links, ratings/number of reviews, description and more can be found here; 5. Once familiarized with each type for metadata for each column provided by this dataset – begin exploring any correlations between them to deepen understanding about trends among different types for books over time – broken down by category; 6 Lastly – use all available resources through 3rd-party packages within your chosen programming language to continue exploring deeper analysis possibilities (e.g., Pandas).

By following these steps - you are now ready to start exploring powerful literature insights into contemporary reading material standards! Enjoy discovering hidden insights within this book metadata - that may have otherwise gone undiscovered!

Research Ideas

Generating recommendations of books based on popularity, price point and/of rating.

Tracking the success of certain authors/publishers in the long term and understanding their audience preferences.

Analysing which types of books consumers prefer (genre, age group targeting) over time to provide useful data to new authors to increase their chances of success

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: dataset.csv | Column name | Description | |:----------------------------------------|:---------------------------------------------------| | Logan Kade (Fallen Crest High #5.5) | Title of the book. (String) | | https | Image URL of the book. (String) | | Two | Rating of the book. (Integer) | | Academic | Description Category of the book. (String) | | 7093cf549cd2e7de | Universal Product Code (UPC) of the book. (String) | | Books | Product Type of the book. (String) | | £13.12 | Price Excluding Tax of the book. (Float) | | £13.12.1 | Price Including Tax of the book. (Float) | | £0.00 | Tax Amount of the book. (Float) | | In stock (5 available) | Availability of the book. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Traffic Sign Dataset - Classification
kaggle.com
zip
Updated Dec 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aluru V N M Hemateja (2021). Traffic Sign Dataset - Classification [Dataset]. https://www.kaggle.com/datasets/ahemateja19bec1025/traffic-sign-dataset-classification
Explore at:
zip(199059936 bytes)Available download formats
Dataset updated
Dec 21, 2021
Authors
Aluru V N M Hemateja
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Here is the dataset for classifying the different classes of traffic signs. There are around 58 classes and each class has around 120 images. the labels.csv file has the respective description of the traffic sign class. You can change the assignment of these classIDs with descriptions. We can use the basic CNN model to get decent val accuracy. We have around 2000 files for testing.

You can view the notebook named official in the code section to train and test basic cnn model.

Please upvote the notebook and dataset if you like this.
Finnish Basic Education Teacher ICT Skills and
kaggle.com
zip
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Finnish Basic Education Teacher ICT Skills and [Dataset]. https://www.kaggle.com/datasets/thedevastator/finnish-basic-education-teacher-ict-skills-and-u
Explore at:
zip(32539 bytes)Available download formats
Dataset updated
Feb 11, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Finnish Basic Education Teacher ICT Skills and Usage

Age, Gender, Self-Efficacy, In-Service Training, and Urbanization Level

By [source]

About this dataset

This dataset contains data on the digital usage habits and ICT skills of Finnish basic education teachers from 2017-2019. It includes valuable background information such as age, gender, postal code of place of employment, teacher types, and urbanization level. Furthermore, this dataset also includes variables that measure self-efficacy in digital skills; perceived adequacy of in-service training in digital skills; frequency with which these teachers use digital technologies; and a summative measure for identifying, retrieving, processing, and sharing information.

With this data set researchers can study the effects of existing programs targeted to enhance teachers’ technology usage in Finnish basic education as well as explore the differences between different demographics when it comes to ICT knowledge and activity levels. The connection between age groups or geographical areas' digital literacy can be revealed by analyzing trends from the data presented here. Additionally Researchers will be able to gain insight on how urbanization affects ICT skill levels among teachers as well as look into whether adequate training is being provided for keeping up with changing technologies in educational environments.

This comprehensive dataset is an incredibly valuable resource for those studying the role that technology has to play in our current educational systems

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides valuable insights into Finnish basic education teachers’ ICT skills and how they use digital technology in the classroom. Here are some tips to help you get the most out of this dataset:

Analyze the demographic characteristics of teachers in Finland and identify any patterns or trends that may exist between teacher characteristics and their self-efficacy, training adequacy, and digital activity levels.

Examine how age, gender, urbanization level (or lack thereof), teacher type, and information skills affect perceived digital competency levels among Finnish basic education teachers.

Compare different types of training programs for Finnish basic education teachers to discern which are most effective at improving their ICT skills as well as their adoption of digital technologies in the classroom.

Utilize this data to understand Finland’s approach to digital literacy education across geographic regions, with a particular focus on rural areas versus more urbanized zones where access to technology varies significantly.

Finally, research how digital usage habits among Finnish basic educators may be changing over time by utilizing data from multiple years within this dataset as a starting point for further investigation into trends over time ining self-efficacy ratings or frequency/type of usage by year or season

Research Ideas

Analyzing differences in digital skills, self-efficacy, and usage habits between different age groups and genders.

Examining the relationship between urbanization level and teachers’ digital activity.

Investigating how information technology skills can be used to enhance digital literacy in Finnish classrooms

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Information_skills_teachers.csv | Column name | Description | |:-----------------------|:-----------------------------------------------------------------------------------------| | Urbanization_level | Level of urbanization of the teacher's place of employment. (Categorical) | | Age | Age of teacher. (Numerical) | | Self_efficacy | Self-efficacy in digital skills. (Categorical) | | Inservice_training | Perceived adequacy of in-service training in digital skills. (Categorical) ...
Synthia-v1.3
kaggle.com
huggingface.co
zip
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Synthia-v1.3 [Dataset]. https://www.kaggle.com/datasets/thedevastator/human-machine-dialogue-interactions
Explore at:
zip(79056480 bytes)Available download formats
Dataset updated
Nov 22, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Human-Machine Dialogue Interactions

Exploring Communication Models for Machine Learning

By Huggingface Hub [source]

About this dataset

This Synthia-v1.3 dataset provides insight into the complexities of human-machine communication through its collection of dialogue interactions between humans and machines. Contained within this dataset are details on how conversations develop between the two, detailing behavioural changes in both humans and machines towards one another over time. With information provided on both user instructions to machines, as well as the system, machine responses and other related data points, this dataset offers a detailed overview of machine learning concepts, examining how systems utilise dialogue to interact with people in various scenarios. This can offer valuable insight into how predictive intelligence is applied by these systems in conversational settings, better informing developers seeking to build their own human-machine interfaces for effective two-way communication. By looking at this data set as a whole it can create an understanding of the way connections form between humans and machines providing a deeper level of appreciation for ongoing challenges faced when working on projects with these technological components at play

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The dataset consists of a collection of dialogue interactions between humans and machines, providing insight into human-machine communication. It includes information about the system being used, instructions given by humans to machines and responses from machines.

To start using this data set: - Download the csv file containing all of the dialogue interactions from Kaggle datasets page. - Open up your favourite spreadsheet software like Excel or Google Sheets and load up the CSV file - Take a look at each of the columns listed in order to familiarize yourself with what they contain: ‘system’ column contains details about what system was used for role play between human and machine; ‘instruction’ column contains instructions given by humans to machines; ‘response’ column contains responses from machines back to humans based on their instructions
- Start exploring how conversations progress between humans and machine over time by examining information in each of these columns separately or together as required

You can also filter out specific conditions within your data set such as searching for conversations that were driven entirely by particular systems or involving certain instruction types etc. In addition, you have an opportunity conduct various kinds of analysis such as statistical analysis (e.g., descriptive statistics or correlation analysis). With so many possibilities for exploration, you are sure find something interesting!

Research Ideas

Utilizing the dataset to understand how various types of instruction styles can influence conversation order and flow between humans and machines.

Using the data to predict potential responses in a given dialogue interaction from varying sources, such as robots or virtual assistants.

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:--------------------------------------------------------------| | system | The type of system used in the dialogue interaction. (String) | | instruction | The instruction given by the human to the machine. (String) | | response | The response given by the machine to the human. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Exploring Open Source Java Technical Debt
kaggle.com
zip
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Exploring Open Source Java Technical Debt [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-open-source-java-technical-debt
Explore at:
zip(1639103449 bytes)Available download formats
Dataset updated
Feb 11, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Exploring Open Source Java Technical Debt

Analyzing 55 Million Files

By [source]

About this dataset

This dataset contains a treasure-trove of information on over 55 million open source Java files, providing technical debt-related insights that can be used to inform a range of research and analytical activities. Every file captured in the dataset is assigned an MD5-hash to ensure unique identification, along with key metrics including its technical debt probability, fan-in/fan-out levels, total methods & variables, lines of code & comment lines, and the number of occurrences recorded.

These data points can each provide important guidance into the magnitude and scope of technical debt in open source Java software development projects. Researchers can analyse correlations between their technical debt probability and levels of fan-in/fan-out as well as variables such as methods created & number of lines written. Meanwhile analysts are enabled to identify files with high impacts on code quality through comparing their joint location in both technical debt probability rankings and highest occurrence rankings.

Utilizing this comprehensive dataset opens up opportunities for a wide range of investigations which seek to unlock greater understanding surrounding the complex relationships between software development practices and code quality. It presents an invaluable resource for anyone looking to gain key insights into spiritual subject matter– turning questions into answers via exploration!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to use this dataset:

The dataset contains several columns with different pieces of information including file_md5 (a unique identifier for each file), td_probability (the probability that the file contains technical debt), fanin (the number of incoming dependencies for the file), fanout (the number of outgoing dependencies for the file), total methods and variables, total lines of code and comment lines. researchers or analysts may perform statistical analysis on these parameters to get an overall idea of the impact that these values have on code quality. Additionally they may also find correlations between certain values such as fan-in/fan-out ratio and sums or averages when it comes to looking at methods/variables used in a particular set of files. Finally they can look at occurences column which contains information about how many times a particular MD5 hash has been used in open source repositories - this could help identify any particularly well received files which have been widely used across multiple platforms

By examining these columns together you will be able to gain insight into trends related to technical debt in Open Source Java programs as well as identify key areas where there is potential danger/challenges associated with implementation within your own projects. With enough data manipulation you may even make predictions regarding future implementation based on past experiences!

Research Ideas

Correlating technical debt probability and lines of code or variables to determine how additional code complexity impacts the magnitude of technical debt.

Identifying files with a high probability of technical debt which have been used in multiple projects, so that those files may be improved to help future projects.

Analyzing the average fan-in and fan-out for different programming paradigms, such as MVC, to determine if any design patterns produce higher degrees of technical debt than other paradigms or architectures

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: TD_of_55M_files.csv | Column name | Description | |:--------------------|:---------------------------------------------------------------------------------------------------------------------| | file_md5 | A unique identifier for each file that can also be used to track them across repositories or other sources. (String) | | td_probability | The p...
Comprehensive Medical Q&A Dataset
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Comprehensive Medical Q&A Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset
Explore at:
zip(5126941 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Comprehensive Medical Q&A Dataset

Unlocking Healthcare Data with Natural Language Processing

By Huggingface Hub [source]

About this dataset

The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.

Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.

Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!

Research Ideas

Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.

Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.

Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
FakeCovid Fact-Checked News Dataset
kaggle.com
zip
Updated Feb 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). FakeCovid Fact-Checked News Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/fakecovid-fact-checked-news-dataset
Explore at:
zip(19911252 bytes)Available download formats
Dataset updated
Feb 1, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
FakeCovid Fact-Checked News Dataset

International Coverage of COVID-19 in 40 Languages from 105 Countries

By [source]

About this dataset

The FakeCovid dataset is an unparalleled compilation of 7623 fact-checked news articles related to COVID-19. Obtained from 92 fact-checking websites located in 105 countries, this comprehensive collection covers a wide range of sources and languages, including locations across Africa, Europe, Asia, The Americas and Oceania. With data gathered from references on Poynter and Snopes, this unique dataset is an invaluable resource for researching the accuracy of global news related to the pandemic. It offers an invaluable insight into the international nature of COVID information with its column headers covering country's involved; categories such as coronavirus health updates or political interference during coronavirus; URLs for referenced articles; verifiers employed by websites; article classes that can range from true to false or even mixed evaluations; publication dates ; article sources injected with credibility verification as well as article text and language standardization. This one-of-a kind dataset serves as an essential tool in understanding both global information flow around the world concerning COVID 19 while simultaneously offering transparency into whose interests guide it

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The FakeCovid dataset is a multilingual cross-domain collection of 7623 fact-checked news articles related to COVID-19. It is collected from 92 fact-checking websites and covers a wide range of sources and countries, including locations in Africa, Asia, Europe, The Americas, and Oceania. This dataset can be used for research related to understanding the truth and accuracy of news sources related to COVID-19 in different countries and languages.

To use this dataset effectively, you will need basic knowledge of data science principles such as data manipulation with pandas or Python libraries such as NumPy or ScikitLearn. The data is in CSV (comma separated values) format that can be read by most spreadsheet applications or text editor like Notepad++. Here are some steps on how to get started: - Access the FakeCovid Fact Checked News Dataset from Kaggle: https://www.kaggle.com/c/fakecovidfactcheckednewsdataset/data - Download the provided CSV file containing all fact checked news articles and place it into your desired folder location - Load the CSV file into your preferred software application like Jupyter Notebook or RStudio 4)Explore your dataset using built-in functions within data science libraries such as Pandas & matplotlib – find meaningful information through statistical analysis &//or create visualizations 5)Modify parameters within the csv file if required & save 6)Share your creative projects through Gitter chatroom #fakecovidauthors 7 )Publish any interesting discoveries you find within open source repositories like GitHub 8 )Engage with our Hangouts group #FakeCoviDFactCheckersClub 9 )Show off fun graphics via Twitter hashtag #FakeCovidiauthors 10 )Reach out if you have further questions via email contactfakecovidadatateam 11 )Stay connected by joining our mailing list#FakeCoviDAuthorsGroup

We hope this guide helps you better understand how to use our FakeCoviD Fact Checked News Dataset for generating meaningful insights relating to COVID-19 news articles worldwide!

Research Ideas

Developing an automated algorithm to detect fake news related to COVID-19 by leveraging the fact-checking flags and other results included in this dataset for machine learning and natural language processing tasks.

Training a sentiment analysis model on the data to categorize articles according to their sentiments which can be used for further investigations into why certain news topics or countries have certain outcomes, motivations, or behaviors due to their content relatedness or author biasness(if any).

Using unsupervised clustering techniques, this dataset could be used as a tool for identifying any discrepancies between news circulated in different populations in different countries (langauge and regions) so that publicists can focus more on providing factual information rather than spreading false rumors or misinformation about the pandemic

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Do...
hm-pre-recommendation
kaggle.com
zip
Updated Mar 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nguyentuananh (2022). hm-pre-recommendation [Dataset]. https://www.kaggle.com/datasets/astrung/hm-pre-recommendation
Explore at:
zip(1153481669 bytes)Available download formats
Dataset updated
Mar 20, 2022
Authors
Nguyentuananh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Giba has introduced hybrid approach which use other notebooks result for better performance: https://www.kaggle.com/titericz/h-m-ensembling-how-to/notebook

Altu has a new improved version with lstm model: https://www.kaggle.com/code/atulverma/h-m-ensembling-with-lstm

But you need to download other notebooks result, then upload it if you want to use within your notebook. So i create this dataset for anyone who want to use directly notebook result without download/upload. Please upvote if it help you

Content

This dataset contain 5 results as input using for a hybrid approach in this notebook: * https://www.kaggle.com/titericz/h-m-ensembling-how-to/notebook. * https://www.kaggle.com/code/atulverma/h-m-ensembling-with-lstm

If you want to use this notebook but can't access to private dataset, please add my dataset to your notebook, than change file path. It has 5 files: * submissio_byfone_chris.csv: Submission result from: https://www.kaggle.com/lichtlab/0-0226-byfone-chris-combination-approach
* submission_exponential_decay.csv: Submission result from: https://www.kaggle.com/tarique7/hnm-exponential-decay-with-alternate-items/notebook * submission_trending.csv: Submission result from: https://www.kaggle.com/lunapandachan/h-m-trending-products-weekly-add-test/notebook * submission_sequential_model.csv: Submission result from: https://www.kaggle.com/code/astrung/sequential-model-fixed-missing-last-item/notebook * submission_sequential_with_item_feature.csv: Submission result from: https://www.kaggle.com/code/astrung/lstm-model-with-item-infor-fix-missing-last-item/notebook
ACMUS YouTube Music Dataset
kaggle.com
zip
Updated Feb 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). ACMUS YouTube Music Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/acmus-youtube-music-dataset
Explore at:
zip(8111 bytes)Available download formats
Dataset updated
Feb 12, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
YouTube
Description
ACMUS YouTube Music Dataset

Annotated Music for Instrumental Format Recognition and Vocal Classification

By [source]

About this dataset

The ACMUS YouTube Music Set is an annotated collection of music from YouTube videos, designed to support the exploration of cutting-edge computational methods for two key tasks: Instrumental Format Identification and Vocal Music Classification. Encompassing a wide range of genres and eras, this multi-dimensional dataset contains information such as File Name, Title, Genre, Composer or Artist?, Sampling Rate, Channels, Bit Depth, Duration (sec), Original File (if applicable), Collection from which it was taken from , Observations made about the audio file (if any), Number of Instruments present in the audio file, Presence or absence of Guitar/Bandola/Tiple/Bass/Percussion/, Tempo and Language travelled. Additionally this dataset tackles one step further by including vocal classification based on presence or absence in Female Voice / Male Voice files. This is a great resource for anyone exploring Artificial Intelligence techniques related to music recognition and vocal classification

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Review the columns of information that are included in the dataset: These include File name, Title, Genre, Composer or artist?, Sampling rate, Channels, Bit depth, Duration (sec), Collection, Nr. of instruments, Guitar, Bandola, Tiple Bass Female Voice Male Voice Percussion Tempo Language Artist/Performer Filename Composer Original File Last called (Date) Number of instruments and Female voice and Male voice.

Start by exploring the audio file properties first: these include File name Title Genre Sampling rate Channels Bit depth Duration (sec) Collection Nr. of instruments Guitar Bandola Tiple Bass Female Voice Male Voice Percussion Tempo Language Artist/Performer Filename Composer Original File Number of Instruments Female Voice and Male Voice

Make sure you have a clear understanding about each column before you proceed: This includes all the features associated with each audio file such as title genre composition artist sampling rate bit depth duration in seconds number of tracks original file date uploaded collection observations guitar bandola tiple bass female voice male voice percussion tempo language artist or performer filename composer etc

Establish relationships between different data points by using visualization tools like graphs tables scatter plots etc.: Visualize all related audio file properties like their genre type their channel compositions artist names original files last call dates collections observed noise levels at 64 Hz and 128 Hz identifying cover versions instrumental versions etc
5 Update your research regularly with new findings by revisiting your visualizations comparing features between different formats running clustering algorithms for classification to better group music files accordingly

Research Ideas

Using the Instrumental Format Recognition and Vocal Music Classification tasks with Machine Learning algorithms to create an automated music labeler. The data in this dataset could be used to create a tool that can identify various instruments in an audio file and also classify music as either vocal or instrumental, which can help streamline the process of cataloguing and labeling new music tracks.

This dataset could be used for training computer vision models for automatic instrument recognition from video files. By feeding the dataset into a convolutional neural network, algorithms can be developed to detect different types of instruments from video streams and differentiate between vocal or instrumental pieces.

This dataset could be used for audio source separation research, which is the process of isolating individual audio sources from a mix of sounds within an audio clip or recording. Source separation research often relies on datasets such as this one for providing labeled data about instrumentation and pitch levels that allow researchers to develop algorithms capable of separating multiple sound sources within a single mixture signal

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Inf...
US Covid-19 Cases, Deaths and Mobility
kaggle.com
zip
Updated Jan 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). US Covid-19 Cases, Deaths and Mobility [Dataset]. https://www.kaggle.com/datasets/thedevastator/us-covid-19-cases-deaths-and-mobility-by-state-c
Explore at:
zip(89091036 bytes)Available download formats
Dataset updated
Jan 10, 2023
Authors
The Devastator
Area covered
United States
Description
US Covid-19 Cases, Deaths and Mobility by State/County

Analyzing the Impact of the Pandemic on Low-Income Populations

By Liz Friedman [source]

About this dataset

Welcome to the Opportunity Insights Economic Tracker! Our goal is to provide a comprehensive, real-time look into how COVID-19 and stabilization policies are affecting the US economy. To do this, we have compiled a wide array of data points on spending and employment, gathered from several sources.

This dataset includes daily/weekly/monthly information at the state/county/city level for eight types of data: Google Mobility; Low-Income Employment and Earnings; UI Claims; Womply Merchants and Revenue; as well as weekly Math Learning from Zearn. Additionally, three files- Accounting for Geoids-State/County/City provide crosswalks between geographic areas that can be merged with other files having shared geographical levels.

Our goal here is to enable data users around the world to follow economic conditions in the US during this tumultuous period with maximum clarity and precision. We make all our datasets freely available so if you use them we kindly ask you attribute our work by linking or citing both our accompanying paper as well as this Economic Tracker at https://tracktherecoveryorg By doing so you are also agreeing to uphold our privacy & integrity standards which commit us both to individual & business confidentiality without compromising on independent nonpartisan research & policy analysis!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides US COVID-19 case and death data, as well as Google Community Mobility Reports, on the state/county level. Here is how to use this dataset:

Understand the file structure: This dataset consists of three main files: 1) US Cases & Deaths by State/County, 2) Google Community Mobility Reports, and 3) Data from third-parties providing small business openings & revenue information and unemployment insurance claim data (Low Inc Earnings & Employment, UI Claims and Womply Merchants & Revenue).

Select your Subset: If you are interested in particular types of data (e.g., mobility or employment), select the corresponding files from within each section based on your geographic area of interest – national, state or county level – as indicated in each filename.

Review metadata variables: Become familiar with the provided variables so that you can select which ones you need to explore further in your analysis. For example, if analyzing mobility trends at a city level look for columns such as ‘Retailer_and_recreation_percent_change’ or ‘Transit Stations Percent Change’; if focusing on employment decline look for columns such pay or emp figures that align with industries of interest to you such as low-income earners (emp_{inclow},pay_{inclow}).

Unify dateformatting across row values : Convert date formats into one common unit so that all entries have consistent formatting if necessary; for exampe some entries may display dates using YYYY/MM/DD notation while others may use MM//DD//YY format depending on their source datasets; make sure to review column labels carefully before converting units where needed..

Merge datasets where applicable : Utilize GeoID crosswalks to combine multiple sets with same geographical coverageregionally covering ; example might be combining low income earnings figures with specific county settings by reference geo codes found in related documents like GeoIDs-County .
6 . Visualise Data : Now that all the different measures have been reviewed can begin generating charts visualize findings . This process may include cleaning up raw figures normalizing across currency formats , mapping geospatial locations others ; once ready create bar graphs line charts maps other visual according aggregate output desired Insightful representations at this stage will help inform concrete policy decisions during outbreak recovery period..

Remember to cite

Research Ideas

Estimating the Impact of the COVID-19 Pandemic on Small Businesses - By comparing county-level Womply revenue and employment data with pre-COVID data, policymakers can gain an understanding of the economic impact that COVID has had on local small businesses.

Analyzing Effects of Mobility Restrictions - The Google Mobility data provides insight into geographic areas where...
South Park Scripts Dataset
kaggle.com
Updated Nov 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). South Park Scripts Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/south-park-scripts-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
South Park Scripts Dataset

All the Words, All the Time

By [source]

About this dataset

This dataset contains every word spoken by a character in the first 16 seasons of the TV show South Park. That's over 1 million words in all! Whether you're a fan of South Park or not, this is an interesting dataset to explore natural language processing and see what insights can be gleaned from such a large corpus of text

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains all of the words spoken by characters in the South Park TV show. It is divided into seasons, with each season containing a number of episodes. For each episode, there is a transcript of what was said by each character.

This dataset can be used to study the language used in the South Park TV show, as well as to study how the dialogue changes over time

Research Ideas

Sentiment analysis of the South Park scripts

Word clouds for each character

Finding the most common words used in each season

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: All-seasons.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |

File: Season-1.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |

File: Season-10.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |

File: Season-11.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |

File: Season-12.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |

File: Season-13.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |

File: Season-14.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |

File: Season-15.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | **...
Yelp Reviews Sentiment Dataset
kaggle.com
zip
Updated Nov 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Yelp Reviews Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/yelp-reviews-sentiment-dataset/code
Explore at:
zip(169587518 bytes)Available download formats
Dataset updated
Nov 25, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Yelp Reviews Sentiment Dataset

A Challenge for Natural Language Processing

By Huggingface Hub [source]

About this dataset

The Yelp Reviews Polarity dataset is a collection of Yelp reviews that have been labeled as positive or negative. This dataset is perfect for natural language processing tasks such as sentiment analysis

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This YELP reviews dataset is a great natural language processing dataset for anyone looking to get started with text classification. The data is split into two files: train.csv and test.csv. The training set contains 7,000 reviews with labels (0 = negative, 1 = positive), and the test set contains 3,000 unlabeled reviews.

To get started with this dataset, download the two CSV files and put them in the same directory. Then, open up train.csv in your favorite text editor or spreadsheet software (I like using Microsoft Excel). Next, take a look at the first few rows of data to get a feel for what you're working with:

text label
So there is no way for me to plug it in here in the US unless I go by... 0

Research Ideas

This dataset could be used to train a machine learning model to classify Yelp reviews as positive or negative.

This dataset could be used to train a machine learning model to predict the star rating of a Yelp review based on the text of the review.

This dataset could be used to build a natural language processing system that generates fake Yelp reviews

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |

File: test.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Risk Factors for Cardiovascular Heart Disease
kaggle.com
zip
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Risk Factors for Cardiovascular Heart Disease [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-risk-factors-for-cardiovascular-diseas
Explore at:
zip(944471 bytes)Available download formats
Dataset updated
Jan 12, 2023
Authors
The Devastator
Description
Exploring Risk Factors for Cardiovascular Disease in Adults

Examining Age, Gender, Height, Weight and Health Metrics

By Kuzak Dempsy [source]

About this dataset

This dataset contains detailed information on the risk factors for cardiovascular disease. It includes information on age, gender, height, weight, blood pressure values, cholesterol levels, glucose levels, smoking habits and alcohol consumption of over 70 thousand individuals. Additionally it outlines if the person is active or not and if he or she has any cardiovascular diseases. This dataset provides a great resource for researchers to apply modern machine learning techniques to explore the potential relations between risk factors and cardiovascular disease that can ultimately lead to improved understanding of this serious health issue and design better preventive measures

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can be used to explore the risk factors of cardiovascular disease in adults. The aim is to understand how certain demographic factors, health behaviors and biological markers affect the development of heart disease.

To start, look through the columns of data and familiarize yourself with each one. Understand what each field means and how it relates to heart health: - Age: Age of participant (integer) - Gender: Gender of participant (male/female). - Height: Height measured in centimeters (integer) - Weight: Weight measured in kilograms (integer) - Ap_hi: Systolic blood pressure reading taken from patient (integer) - Ap_lo : Diastolic blood pressure reading taken from patient (integer) - Cholesterol : Total cholesterol level read as mg/dl on a scale 0 - 5+ units( integer). Each unit denoting increase/decrease by 20 mg/dL respectively.
‐ Gluc : Glucose level read as mmol/l on a scale 0 - 16+ units( integer). Each unit denoting increase Decreaseby 1 mmol/L respectively. ‐ Smoke : Whether person smokes or not(binary; 0= No , 1=Yes). ‐ Alco : Whether person drinks alcohol or not(binary; 0 =No ,1 =Yes ). • Active : whether person physically active or not( Binary ;0 =No,1 = Yes ). . Cardio : whether person suffers from cardiovascular diseases or not(Binary ;0 – no , 1 ‑yes ).Identify any trends between the different values for each attribute and the developmetn for cardiovascular disease among individuals represented by this dataset . Age, gender, weight, lifestyle practices like smoking & drinking alcohol are all key influences when analyzing this problem set. You can always modify pieces of your analysis until you're able to find patterns that will enable you make conclusions based on your understanding & exploration. You can further enrich your understanding using couple mopdeling technique like Regressions & Classification models over this dataset alongwith latest Deep Learning approach! Have Fun!

Research Ideas

Analyzing the effect of lifestyle and environmental factors on the risk of cardiovascular disease.

Predicting the risks of different age groups based on their demographic characteristics such as gender, height, weight and smoking status.

Detecting patterns between levels of physical activity, blood pressure and cholesterol levels with likelihood of developing cardiovascular disease among individuals

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

See the dataset description for more information.

Columns

File: heart_data.csv | Column name | Description | |:----------------|:---------------------------------------------------------| | age | Age of the individual. (Integer) | | gender | Gender of the individual. (String) | | height | Height of the individual in centimeters. (Integer) | | weight | Weight of the individual in kilograms. (Integer) | | ap_hi | Systolic blood pressure reading. (Integer) | | ap_lo | Diastolic blood pressure reading. (Integer) | | cholesterol | Cholesterol level of the individual. (Integer) | | gluc |...
Physical Strength & Fear-Related Personality
kaggle.com
zip
Updated Jan 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Physical Strength & Fear-Related Personality [Dataset]. https://www.kaggle.com/datasets/thedevastator/physical-strength-correlation-with-fear-related
Explore at:
zip(27811 bytes)Available download formats
Dataset updated
Jan 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Physical Strength & Fear-Related Personality

Exploring Gender Differences across Five University Samples

By [source]

About this dataset

This dataset offers a fascinating insight into gender differences in fear-related personality traits and their correlation with physical strength across five university samples. It includes demographic information such as age, gender, and ethnicity, as well as physical strength measures – grip strength and chest strength – taken from undergraduate students from the University of California Santa Barbara, Oklahoma State University, University of Texas Austin, and Arizona State University. Additionally, the dataset includes self-report measures of HEXACO Emotionality to explore the effects of physical strength on fear-related personality traits - which is key information to consider when designing interventions for mental health issues. With this data we could discover how temperament affects physiological parameters such as grip or chest strength: Does having a fearful personality predispose someone to have decreased levels of physical power? How does this route differ dependeing on sex? Answering these questions could allow us to gain valuable insights into how greater bodily prowess affects unique psychological conditions that differ depending on gender. Do not miss out on this amazing opportunity to learn more about fear-induced personality features associated with physical force!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

Research Ideas

Using the physical strength and fear-related personality trait measures, universities can identify and target students who might need extra support in an intervention to improve mental health and wellbeing.

Exploring correlations between physical strength measures, overall HEXACO Emotionality scores, as well as the Anxiousness, Fearfulness, Sentimentalism, and Emotional Dependence facets of HEXACO Emotionality to understand how gender differences in fear-related personality traits may be affected by physical strength.

Comparing the distributions of simple demographic measures such as age and ethnicity across five different university samples to explore commonalities or potential differences among student populations at different universities

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Sample_1.csv | Column name | Description | |:--------------|:----------------------------------------------------------- | age | Age of the participant. (Numeric) | female | Gender of the participant (1 = Female, | ethnicity | Ethnicity of the participant. (Categorical) | grip | Grip strength of the participant. (Numeric) | chest | Chest strength of the participant. (Numeric) | e_anx_1 | Fearfulness score of the participant. (Numeric) | e_anx_2 | Anxiety score of the participant. (Numeric) | e_anx_3 | Sentimentalism score of the participant. (Numeric) | e_anx_4 | Emotional Dependence score of the participant. (Numeric) | e_anx_5 | Fearfulness score of the participant. (Numeric) | e_anx_6 | Anxiety score of the participant. (Numeric) | e_anx_7 | Sentimentalism score of the participant. (Numeric) | e_anx_8 | Emotional Dependence score of the participant. (Numeric) | e_anx_9 | Fearfulness score of the participant. (Numeric) | e_anx_10 | Anxiety score of the participant. (Numeric) | e_dep_1 | Sentimentalism score of the participant. (Numeric) | e_dep_2 | Emotional Dependence score of the participant. (Numeric) | e_dep_3 | Fearfulness score of the participant. (Numeric) | e_dep_4 | Anxiety score of the participant. (Numeric) | e_dep_5 | Sentimentalism score of the participant. ... -----| | 0 = Male). (Categorical) | | | | | | | | | | | | | | | | | |
Reddit's /r/Gamestop
kaggle.com
zip
Updated Nov 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit's /r/Gamestop [Dataset]. https://www.kaggle.com/datasets/thedevastator/gamestop-inc-stock-prices-and-social-media-senti
Explore at:
zip(186464492 bytes)Available download formats
Dataset updated
Nov 28, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit's /r/Gamestop

Merge this dataset with gamestop price data to study how the chat impacted

By SocialGrep [source]

About this dataset

The stonks movement spawned by this is a very interesting one. It's rare to see an Internet meme have such an effect on real-world economy - yet here we are.

This dataset contains a collection of posts and comments mentioning GME in their title and body text respectively. The data is procured using SocialGrep. The posts and the comments are labelled with their score.

It'll be interesting to see how this effects the stock market prices in the aftermath with this new dataset

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The file contains posts from Reddit mentioning GME and their score. This can be used to analyze how the sentiment on GME affected its stock prices in the aftermath

Research Ideas

To study how social media affects stock prices

To study how Reddit affects stock prices

To study how the sentiment of a subreddit affects stock prices

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: six-months-of-gme-on-reddit-comments.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | body | The body of the post or comment. (String) | | sentiment | The sentiment of the post or comment. (String) | | score | The score of the post or comment. (Integer) |

File: six-months-of-gme-on-reddit-posts.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | score | The score of the post or comment. (Integer) | | domain | The domain of the post or comment. (String) | | url | The URL of the post or comment. (String) | | selftext | The selftext of the post or comment. (String) | | title | The title of the post or comment. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.

text	label
So there is no way for me to plug it in here in the US unless I go by...	0

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). CoEdIT [Dataset]. https://www.kaggle.com/datasets/thedevastator/coedit-nlp-editing-dataset

CoEdIT

Enhancing AI Text Editing Through 69,000 Instances

Explore at:

zip(4681073 bytes)Available download formats

Dataset updated

Nov 26, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

CoEdIT

Enhancing AI Text Editing Through 69,000 Instances

By Huggingface Hub [source]

About this dataset

This dataset provides 69,000 instances of natural language processing (NLP) editing tasks to help researchers develop more effective AI text-editing models. Compiled into a convenient JSON format, this collection offers easy access so that researchers have the tools they need to create groundbreaking AI models that efficiently and effectively redefine natural language processing. This is your chance to be at the forefront of NLP technology and make history through innovative AI capabilities. So join in and unlock a world of possibilities with CoEdIT's Text Editing Dataset!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Familiarize yourself with the format of the dataset by taking a look at the columns: task, src, tgt. You’ll see that each row in this dataset contains a specific NLP editing task as well as source text (src) and target text (tgt) which displays what should result from that editing task.

Import the JSON file of this dataset into your machine learning environment or analyses software toolbox of choice. Some popular options include Python's Pandas library and BigQuery on Google Cloud Platforms for larger datasets like this one oryoou can also import them into Excel Toolboxes .

Once you've imported the data into your chosen program, you can now start exploring! Take a look around at various rows to get an idea of how different types of edits need to be made on source text in order to produce target text successfully meeting given criteria depending on needs/ tasks come together; Make sure you read any documents associated with each column helps understand better context before beginning your analysis or coding part

Test out coding solutions which process different types and scales of edits - if understanding how punctuation impacts sentence similarity measures gives key insight into meaning being conveyed then develop code accordingly ,playing around with different methods utilizing common ML/NLP algorithms & libraries like NLTK , etc

5 Finally – now that have tested conceptual ideas begin work creating efficient & effective AI-powered models system using training data specifically catered towards given tasks at hand; Evaluate performance with validation & test datasets prior getting production ready

Research Ideas

Automated Grammar Checking Solutions: This dataset can be used to train machine learning models to detect grammatical errors and suggest proper corrections.

Text Summarization: Using this dataset, researchers can create AI-powered summarization algorithms that summarize long-form passages into shorter summaries while preserving accuracy and readability

Natural Language Generation: This dataset could be used to develop AI solutions that generate accurately formatted natural language sentences when given a prompt or some other form of input

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------------------| | Task | This column describes the task that the dataset is intended to be used for. (String) | | src | This column contains the source text input. (String) | | tgt | This column contains the target text output. (String) |

File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------------------| | Task | This column describes the task that the dataset is intended to be used for. (String) | | src | This column contains the source text input. (String) | | tgt | This column contains the target text output. (String) ...

Clear search

Close search

Google apps

Main menu

CoEdIT

CoEdIT

Enhancing AI Text Editing Through 69,000 Instances

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Reddit /r/datasets Dataset

The Meta-Corpus of Datasets: The Reddit Dataset

The Complete Collection of Datasets Posted on Reddit

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

numerai data V5.0 Universe

Youtube Quality Videos Classification

Youtube Quality Videos Classification

How to Tell If a Video is Good or Bad

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Temperature Over Time by State (Starts: 1895)

Temperature Over Time by State (Starts: 1895)

State and County Temperature Changes

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Book Metadata from BooksToScrape

Book Metadata from BooksToScrape

Unlocking Hidden Insights in Reading Material

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Traffic Sign Dataset - Classification

Finnish Basic Education Teacher ICT Skills and

Finnish Basic Education Teacher ICT Skills and Usage

Age, Gender, Self-Efficacy, In-Service Training, and Urbanization Level

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Synthia-v1.3

Human-Machine Dialogue Interactions

Exploring Communication Models for Machine Learning

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements