Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides 69,000 instances of natural language processing (NLP) editing tasks to help researchers develop more effective AI text-editing models. Compiled into a convenient JSON format, this collection offers easy access so that researchers have the tools they need to create groundbreaking AI models that efficiently and effectively redefine natural language processing. This is your chance to be at the forefront of NLP technology and make history through innovative AI capabilities. So join in and unlock a world of possibilities with CoEdIT's Text Editing Dataset!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Familiarize yourself with the format of the dataset by taking a look at the columns: task, src, tgt. You’ll see that each row in this dataset contains a specific NLP editing task as well as source text (src) and target text (tgt) which displays what should result from that editing task.
- Import the JSON file of this dataset into your machine learning environment or analyses software toolbox of choice. Some popular options include Python's Pandas library and BigQuery on Google Cloud Platforms for larger datasets like this one oryoou can also import them into Excel Toolboxes .
Once you've imported the data into your chosen program, you can now start exploring! Take a look around at various rows to get an idea of how different types of edits need to be made on source text in order to produce target text successfully meeting given criteria depending on needs/ tasks come together; Make sure you read any documents associated with each column helps understand better context before beginning your analysis or coding part
Test out coding solutions which process different types and scales of edits - if understanding how punctuation impacts sentence similarity measures gives key insight into meaning being conveyed then develop code accordingly ,playing around with different methods utilizing common ML/NLP algorithms & libraries like NLTK , etc
5 Finally – now that have tested conceptual ideas begin work creating efficient & effective AI-powered models system using training data specifically catered towards given tasks at hand; Evaluate performance with validation & test datasets prior getting production ready
- Automated Grammar Checking Solutions: This dataset can be used to train machine learning models to detect grammatical errors and suggest proper corrections.
- Text Summarization: Using this dataset, researchers can create AI-powered summarization algorithms that summarize long-form passages into shorter summaries while preserving accuracy and readability
- Natural Language Generation: This dataset could be used to develop AI solutions that generate accurately formatted natural language sentences when given a prompt or some other form of input
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------------------| | Task | This column describes the task that the dataset is intended to be used for. (String) | | src | This column contains the source text input. (String) | | tgt | This column contains the target text output. (String) |
File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------------------| | Task | This column describes the task that the dataset is intended to be used for. (String) | | src | This column contains the source text input. (String) | | tgt | This column contains the target text output. (String) ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SocialGrep [source]
A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.
Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.
In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.
You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes
- Finding correlations between different types of datasets
- Determining which datasets are most popular on Reddit
- Analyzing the sentiments of post and comments on Reddit's /r/datasets board
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |
File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Weekly updated dataset with the latest version of Numerai tournament data. The dataset contains the directory with the name of the latest data version. Currently it is V5.0. The data are downloaded weekly by public Kaggle notebook numerai data whenever new data are available (opening of Saturday round). Upon a change in this notebook output, the dataset is automatically updated. So, you can add this dataset to your notebooks as data source or output of numerai data notebook and you do not need to download it yourself.
Older versions of data are available elsewhere: * V4 and V4.1 - dataset and producing notebook * V4.2 Rain - dataset and producing notebook * V4.3 Midnight - dataset and producing notebook
Text file current_round.txt contains the number of tournament round when data were successfully downloaded.
In addition to all data files provided by Numerai, downloading notebook creates four partitions of non-overlapping eras for training and validation data. These files are stored in f"train_no{split}.parquet" and f"validation_no{split}.parquet" files. Since Round 864 polars library is used to produce downsampled files. Because polars are not using the index concept, the saved data file stores the index id as another column. If you need the same index as in original files you should add following check to your code right after df = pandas.read_parquet(filename):
if not ("id" in df.index.names):
df.set_index("id", inplace=True)
Facebook
TwitterThis dataset is important as it can help users find good quality videos more easily. The data was collected using the Youtube API and includes a total of _ videos
Columns: Channel title, view count, like count, comment count, definition, caption, subscribers, total views, average polarity score, label
In order to use this dataset, you will need to have the following: -A YouTube API key -A text editor (e.g. Notepad++, Sublime Text, etc.)
Once you have collected these items, you can begin using the dataset. Here is a step-by-step guide: 1) Navigate to the folder where you saved the dataset. 2) Right-click on the file and select Open with > Your text editor. 3) copy your YouTube API key and paste it in place of Your_API_Key in line 4 of the code. 4) Save the file and close your text editor. 5) Navigate to the folder in your terminal/command prompt and type jupyter notebook. This will open a Jupyter Notebook in your browser window.
This dataset can be used for a number of different things including: 1. Finding good quality videos on youtube 2. Determining which videos are more likely to be reputable 3. Helping people find videos they will enjoy
The data for this dataset was collected using the Youtube API and includes a total of _ videos
See the dataset description for more information.
File: dataframeclean.csv | Column name | Description | |:-----------------------|:--------------| | **** | | | channelTitle | | | viewCount | | | likeCount | | | commentCount | | | definition | | | caption | | | subscribers | | | totalViews | | | avg polarity score | | | Label | | | pushblishYear | | | durationSecs | | | tagCount | | | title length | | | description length | |
File: ytdataframe.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | **** | | | channelTitle | | | viewCount | | | likeCount | | | commentCount | | | definition | | | caption | | | subscribers | | | totalViews | | | avg polarity score | | | Label | | | title | The title of the video. (String) | | description | A description of the video. (String) | | tags | The tags associated with the video. (String) | | publishedAt | The date and time the video was published. (String) | | favouriteCount | The number of times the video has been favorited. (Integer) | | duration | The length of the video in seconds. (Integer) |
File: ytdataframe2.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | **** | | | channelTitle | | | title | The title of the video. (String) | | description | A description of the video. (String) | | tags | The tags associated with the video. (String) | | publishedAt | The date and time the video was published. (String) | | viewCount | | | **...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Environmental Data [source]
Do you want to know how rising temperatures are changing the contiguous United States? The Washington Post has used National Oceanic and Atmospheric Administration's Climate Divisional Database (nClimDiv) and Gridded 5km GHCN-Daily Temperature and Precipitation Dataset (nClimGrid) data sets to help analyze warming temperatures in all of the Lower 48 states from 1895-2019. To provide this analysis, we calculated annual mean temperature trends in each state and county in the Lower 48 states. Our results can be found within several datasets now available on this repository.
We are offering: Annual average temperatures for counties and states, temperature change estimates for each of the Lower 48-states, temperature change estimates for counties in the contiguous U.S., county temperature change data joined to a shapefile in GeoJSON format, gridded temperature change data for the contiguous U.S. in GeoTiff format - all contained with our dataset! We invite those curious about climate change to explore these data sets based on our analysis over multiple stories published by The Washington Post such as Extreme climate change has arrived in America, Fires, floods and free parking: California’s unending fight against climate change, In fast-warming Minnesota, scientists are trying to plant the forests of the future, This giant climate hot spot is robbing West of its water ,and more!
By accessing our dataset containing columns such as fips code, year range from 1895-2019, three season temperatures (Fall/Spring/Summer/Winter), max warming season temps plus temp recorded total yearly - you can become an active citizen scientist! If publishing a story or graphic work based off this data set please credit The Washington Post with a link back to this repository while sending us an email so that we can track its usage as well - 2cdatawashpost.com.
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The main files provided by this dataset are climdiv_state_year, climdiv_county_year, model_state, model_county , climdiv_national_year ,and model county .geojson . Each file contains different information capturing climate change across different geographies of the United States over time spans from 1895.
- Investigating and mapping the temperatures for all US states over the past 120 years, to observe long-term changes in temperature patterns.
- Examining regional biases in warming trends across different US counties and states to help inform resource allocation decisions for climate change mitigation and adaption initiatives.
- Utilizing the ClimDiv National Dataset to understand continental-level average annual temperature changes, allowing comparison of global average temperatures with US averages over a long period of time
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: climdiv_state_year.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------| | fips | Federal Information Processing Standard code for each county. (Integer) | | year | Year of the temperature data. (Integer) | | tempc | Temperature change from the previous year. (Float) |
File: climdiv_county_year.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------| | fips | Federal Information Processing Standard code for each county. (Integer) | | year | Year of the temperature data. (Integer) | | tempc | Temperature change from the previous year. (Float) |
File: model_state.csv | Column name | Description | |:------------------...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset allows readers to unlock hidden insights into contemporary literature and the books that people are choosing to purchase. It provides comprehensive and powerful data related to a web books retailer, books.toscrape.com, featuring 12 columns of crucial book metadata gathered through web scraping methods in November 2020. Researching publications through this information provides a great sense of insight and understanding into the current reading climate: uncovering emerging trends in what people are buying, reading, rating, and loving worldwide. With this dataset at your disposal you can explore book popularity from a commercial standpoint as well as a creative one; examining publishing preferences from authors' points of view across reviews and genres alike. Dive into discovering the secrets behind book selection habits by delving into topics ranging from rating systems for certain works to pricing analysis for publishers- all fuelled by this carefully organised streamline of data at play here today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
To get started analyzing this dataset with Kaggle notebooks or other tools: - Open up your tool (Kaggle notebook or another tool) that supports reading CSV files
- Import the dataset.csv file into your chosen program - Explore each column individually to better understand what type of book metadata exists within each category – descriptors such as title, image URLs/links, ratings/number of reviews, description and more can be found here; 5. Once familiarized with each type for metadata for each column provided by this dataset – begin exploring any correlations between them to deepen understanding about trends among different types for books over time – broken down by category; 6 Lastly – use all available resources through 3rd-party packages within your chosen programming language to continue exploring deeper analysis possibilities (e.g., Pandas).By following these steps - you are now ready to start exploring powerful literature insights into contemporary reading material standards! Enjoy discovering hidden insights within this book metadata - that may have otherwise gone undiscovered!
- Generating recommendations of books based on popularity, price point and/of rating.
- Tracking the success of certain authors/publishers in the long term and understanding their audience preferences.
- Analysing which types of books consumers prefer (genre, age group targeting) over time to provide useful data to new authors to increase their chances of success
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: dataset.csv | Column name | Description | |:----------------------------------------|:---------------------------------------------------| | Logan Kade (Fallen Crest High #5.5) | Title of the book. (String) | | https | Image URL of the book. (String) | | Two | Rating of the book. (Integer) | | Academic | Description Category of the book. (String) | | 7093cf549cd2e7de | Universal Product Code (UPC) of the book. (String) | | Books | Product Type of the book. (String) | | £13.12 | Price Excluding Tax of the book. (Float) | | £13.12.1 | Price Including Tax of the book. (Float) | | £0.00 | Tax Amount of the book. (Float) | | In stock (5 available) | Availability of the book. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Here is the dataset for classifying the different classes of traffic signs. There are around 58 classes and each class has around 120 images. the labels.csv file has the respective description of the traffic sign class. You can change the assignment of these classIDs with descriptions. We can use the basic CNN model to get decent val accuracy. We have around 2000 files for testing.
You can view the notebook named official in the code section to train and test basic cnn model.
Please upvote the notebook and dataset if you like this.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains data on the digital usage habits and ICT skills of Finnish basic education teachers from 2017-2019. It includes valuable background information such as age, gender, postal code of place of employment, teacher types, and urbanization level. Furthermore, this dataset also includes variables that measure self-efficacy in digital skills; perceived adequacy of in-service training in digital skills; frequency with which these teachers use digital technologies; and a summative measure for identifying, retrieving, processing, and sharing information.
With this data set researchers can study the effects of existing programs targeted to enhance teachers’ technology usage in Finnish basic education as well as explore the differences between different demographics when it comes to ICT knowledge and activity levels. The connection between age groups or geographical areas' digital literacy can be revealed by analyzing trends from the data presented here. Additionally Researchers will be able to gain insight on how urbanization affects ICT skill levels among teachers as well as look into whether adequate training is being provided for keeping up with changing technologies in educational environments.
This comprehensive dataset is an incredibly valuable resource for those studying the role that technology has to play in our current educational systems
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides valuable insights into Finnish basic education teachers’ ICT skills and how they use digital technology in the classroom. Here are some tips to help you get the most out of this dataset:
- Analyze the demographic characteristics of teachers in Finland and identify any patterns or trends that may exist between teacher characteristics and their self-efficacy, training adequacy, and digital activity levels.
- Examine how age, gender, urbanization level (or lack thereof), teacher type, and information skills affect perceived digital competency levels among Finnish basic education teachers.
- Compare different types of training programs for Finnish basic education teachers to discern which are most effective at improving their ICT skills as well as their adoption of digital technologies in the classroom.
- Utilize this data to understand Finland’s approach to digital literacy education across geographic regions, with a particular focus on rural areas versus more urbanized zones where access to technology varies significantly.
- Finally, research how digital usage habits among Finnish basic educators may be changing over time by utilizing data from multiple years within this dataset as a starting point for further investigation into trends over time ining self-efficacy ratings or frequency/type of usage by year or season
- Analyzing differences in digital skills, self-efficacy, and usage habits between different age groups and genders.
- Examining the relationship between urbanization level and teachers’ digital activity.
- Investigating how information technology skills can be used to enhance digital literacy in Finnish classrooms
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: Information_skills_teachers.csv | Column name | Description | |:-----------------------|:-----------------------------------------------------------------------------------------| | Urbanization_level | Level of urbanization of the teacher's place of employment. (Categorical) | | Age | Age of teacher. (Numerical) | | Self_efficacy | Self-efficacy in digital skills. (Categorical) | | Inservice_training | Perceived adequacy of in-service training in digital skills. (Categorical) ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This Synthia-v1.3 dataset provides insight into the complexities of human-machine communication through its collection of dialogue interactions between humans and machines. Contained within this dataset are details on how conversations develop between the two, detailing behavioural changes in both humans and machines towards one another over time. With information provided on both user instructions to machines, as well as the system, machine responses and other related data points, this dataset offers a detailed overview of machine learning concepts, examining how systems utilise dialogue to interact with people in various scenarios. This can offer valuable insight into how predictive intelligence is applied by these systems in conversational settings, better informing developers seeking to build their own human-machine interfaces for effective two-way communication. By looking at this data set as a whole it can create an understanding of the way connections form between humans and machines providing a deeper level of appreciation for ongoing challenges faced when working on projects with these technological components at play
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The dataset consists of a collection of dialogue interactions between humans and machines, providing insight into human-machine communication. It includes information about the system being used, instructions given by humans to machines and responses from machines.
To start using this data set: - Download the csv file containing all of the dialogue interactions from Kaggle datasets page. - Open up your favourite spreadsheet software like Excel or Google Sheets and load up the CSV file - Take a look at each of the columns listed in order to familiarize yourself with what they contain: ‘system’ column contains details about what system was used for role play between human and machine; ‘instruction’ column contains instructions given by humans to machines; ‘response’ column contains responses from machines back to humans based on their instructions
- Start exploring how conversations progress between humans and machine over time by examining information in each of these columns separately or together as requiredYou can also filter out specific conditions within your data set such as searching for conversations that were driven entirely by particular systems or involving certain instruction types etc. In addition, you have an opportunity conduct various kinds of analysis such as statistical analysis (e.g., descriptive statistics or correlation analysis). With so many possibilities for exploration, you are sure find something interesting!
- Utilizing the dataset to understand how various types of instruction styles can influence conversation order and flow between humans and machines.
- Using the data to predict potential responses in a given dialogue interaction from varying sources, such as robots or virtual assistants.
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:----------------|:--------------------------------------------------------------| | system | The type of system used in the dialogue interaction. (String) | | instruction | The instruction given by the human to the machine. (String) | | response | The response given by the machine to the human. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains a treasure-trove of information on over 55 million open source Java files, providing technical debt-related insights that can be used to inform a range of research and analytical activities. Every file captured in the dataset is assigned an MD5-hash to ensure unique identification, along with key metrics including its technical debt probability, fan-in/fan-out levels, total methods & variables, lines of code & comment lines, and the number of occurrences recorded.
These data points can each provide important guidance into the magnitude and scope of technical debt in open source Java software development projects. Researchers can analyse correlations between their technical debt probability and levels of fan-in/fan-out as well as variables such as methods created & number of lines written. Meanwhile analysts are enabled to identify files with high impacts on code quality through comparing their joint location in both technical debt probability rankings and highest occurrence rankings.
Utilizing this comprehensive dataset opens up opportunities for a wide range of investigations which seek to unlock greater understanding surrounding the complex relationships between software development practices and code quality. It presents an invaluable resource for anyone looking to gain key insights into spiritual subject matter– turning questions into answers via exploration!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to use this dataset:
The dataset contains several columns with different pieces of information including file_md5 (a unique identifier for each file), td_probability (the probability that the file contains technical debt), fanin (the number of incoming dependencies for the file), fanout (the number of outgoing dependencies for the file), total methods and variables, total lines of code and comment lines. researchers or analysts may perform statistical analysis on these parameters to get an overall idea of the impact that these values have on code quality. Additionally they may also find correlations between certain values such as fan-in/fan-out ratio and sums or averages when it comes to looking at methods/variables used in a particular set of files. Finally they can look at occurences column which contains information about how many times a particular MD5 hash has been used in open source repositories - this could help identify any particularly well received files which have been widely used across multiple platforms
By examining these columns together you will be able to gain insight into trends related to technical debt in Open Source Java programs as well as identify key areas where there is potential danger/challenges associated with implementation within your own projects. With enough data manipulation you may even make predictions regarding future implementation based on past experiences!
- Correlating technical debt probability and lines of code or variables to determine how additional code complexity impacts the magnitude of technical debt.
- Identifying files with a high probability of technical debt which have been used in multiple projects, so that those files may be improved to help future projects.
- Analyzing the average fan-in and fan-out for different programming paradigms, such as MVC, to determine if any design patterns produce higher degrees of technical debt than other paradigms or architectures
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: TD_of_55M_files.csv | Column name | Description | |:--------------------|:---------------------------------------------------------------------------------------------------------------------| | file_md5 | A unique identifier for each file that can also be used to track them across repositories or other sources. (String) | | td_probability | The p...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.
Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.
Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!
- Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.
- Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.
- Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The FakeCovid dataset is an unparalleled compilation of 7623 fact-checked news articles related to COVID-19. Obtained from 92 fact-checking websites located in 105 countries, this comprehensive collection covers a wide range of sources and languages, including locations across Africa, Europe, Asia, The Americas and Oceania. With data gathered from references on Poynter and Snopes, this unique dataset is an invaluable resource for researching the accuracy of global news related to the pandemic. It offers an invaluable insight into the international nature of COVID information with its column headers covering country's involved; categories such as coronavirus health updates or political interference during coronavirus; URLs for referenced articles; verifiers employed by websites; article classes that can range from true to false or even mixed evaluations; publication dates ; article sources injected with credibility verification as well as article text and language standardization. This one-of-a kind dataset serves as an essential tool in understanding both global information flow around the world concerning COVID 19 while simultaneously offering transparency into whose interests guide it
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The FakeCovid dataset is a multilingual cross-domain collection of 7623 fact-checked news articles related to COVID-19. It is collected from 92 fact-checking websites and covers a wide range of sources and countries, including locations in Africa, Asia, Europe, The Americas, and Oceania. This dataset can be used for research related to understanding the truth and accuracy of news sources related to COVID-19 in different countries and languages.
To use this dataset effectively, you will need basic knowledge of data science principles such as data manipulation with pandas or Python libraries such as NumPy or ScikitLearn. The data is in CSV (comma separated values) format that can be read by most spreadsheet applications or text editor like Notepad++. Here are some steps on how to get started: - Access the FakeCovid Fact Checked News Dataset from Kaggle: https://www.kaggle.com/c/fakecovidfactcheckednewsdataset/data - Download the provided CSV file containing all fact checked news articles and place it into your desired folder location - Load the CSV file into your preferred software application like Jupyter Notebook or RStudio 4)Explore your dataset using built-in functions within data science libraries such as Pandas & matplotlib – find meaningful information through statistical analysis &//or create visualizations 5)Modify parameters within the csv file if required & save 6)Share your creative projects through Gitter chatroom #fakecovidauthors 7 )Publish any interesting discoveries you find within open source repositories like GitHub 8 )Engage with our Hangouts group #FakeCoviDFactCheckersClub 9 )Show off fun graphics via Twitter hashtag #FakeCovidiauthors 10 )Reach out if you have further questions via email contactfakecovidadatateam 11 )Stay connected by joining our mailing list#FakeCoviDAuthorsGroup
We hope this guide helps you better understand how to use our FakeCoviD Fact Checked News Dataset for generating meaningful insights relating to COVID-19 news articles worldwide!
- Developing an automated algorithm to detect fake news related to COVID-19 by leveraging the fact-checking flags and other results included in this dataset for machine learning and natural language processing tasks.
- Training a sentiment analysis model on the data to categorize articles according to their sentiments which can be used for further investigations into why certain news topics or countries have certain outcomes, motivations, or behaviors due to their content relatedness or author biasness(if any).
- Using unsupervised clustering techniques, this dataset could be used as a tool for identifying any discrepancies between news circulated in different populations in different countries (langauge and regions) so that publicists can focus more on providing factual information rather than spreading false rumors or misinformation about the pandemic
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Do...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
But you need to download other notebooks result, then upload it if you want to use within your notebook. So i create this dataset for anyone who want to use directly notebook result without download/upload. Please upvote if it help you
This dataset contain 5 results as input using for a hybrid approach in this notebook: * https://www.kaggle.com/titericz/h-m-ensembling-how-to/notebook. * https://www.kaggle.com/code/atulverma/h-m-ensembling-with-lstm
If you want to use this notebook but can't access to private dataset, please add my dataset to your notebook, than change file path.
It has 5 files:
* submissio_byfone_chris.csv: Submission result from: https://www.kaggle.com/lichtlab/0-0226-byfone-chris-combination-approach
* submission_exponential_decay.csv: Submission result from: https://www.kaggle.com/tarique7/hnm-exponential-decay-with-alternate-items/notebook
* submission_trending.csv: Submission result from: https://www.kaggle.com/lunapandachan/h-m-trending-products-weekly-add-test/notebook
* submission_sequential_model.csv: Submission result from: https://www.kaggle.com/code/astrung/sequential-model-fixed-missing-last-item/notebook
* submission_sequential_with_item_feature.csv: Submission result from: https://www.kaggle.com/code/astrung/lstm-model-with-item-infor-fix-missing-last-item/notebook
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The ACMUS YouTube Music Set is an annotated collection of music from YouTube videos, designed to support the exploration of cutting-edge computational methods for two key tasks: Instrumental Format Identification and Vocal Music Classification. Encompassing a wide range of genres and eras, this multi-dimensional dataset contains information such as File Name, Title, Genre, Composer or Artist?, Sampling Rate, Channels, Bit Depth, Duration (sec), Original File (if applicable), Collection from which it was taken from , Observations made about the audio file (if any), Number of Instruments present in the audio file, Presence or absence of Guitar/Bandola/Tiple/Bass/Percussion/, Tempo and Language travelled. Additionally this dataset tackles one step further by including vocal classification based on presence or absence in Female Voice / Male Voice files. This is a great resource for anyone exploring Artificial Intelligence techniques related to music recognition and vocal classification
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Review the columns of information that are included in the dataset: These include File name, Title, Genre, Composer or artist?, Sampling rate, Channels, Bit depth, Duration (sec), Collection, Nr. of instruments, Guitar, Bandola, Tiple Bass Female Voice Male Voice Percussion Tempo Language Artist/Performer Filename Composer Original File Last called (Date) Number of instruments and Female voice and Male voice.
- Start by exploring the audio file properties first: these include File name Title Genre Sampling rate Channels Bit depth Duration (sec) Collection Nr. of instruments Guitar Bandola Tiple Bass Female Voice Male Voice Percussion Tempo Language Artist/Performer Filename Composer Original File Number of Instruments Female Voice and Male Voice
- Make sure you have a clear understanding about each column before you proceed: This includes all the features associated with each audio file such as title genre composition artist sampling rate bit depth duration in seconds number of tracks original file date uploaded collection observations guitar bandola tiple bass female voice male voice percussion tempo language artist or performer filename composer etc
- Establish relationships between different data points by using visualization tools like graphs tables scatter plots etc.: Visualize all related audio file properties like their genre type their channel compositions artist names original files last call dates collections observed noise levels at 64 Hz and 128 Hz identifying cover versions instrumental versions etc
5 Update your research regularly with new findings by revisiting your visualizations comparing features between different formats running clustering algorithms for classification to better group music files accordingly
- Using the Instrumental Format Recognition and Vocal Music Classification tasks with Machine Learning algorithms to create an automated music labeler. The data in this dataset could be used to create a tool that can identify various instruments in an audio file and also classify music as either vocal or instrumental, which can help streamline the process of cataloguing and labeling new music tracks.
- This dataset could be used for training computer vision models for automatic instrument recognition from video files. By feeding the dataset into a convolutional neural network, algorithms can be developed to detect different types of instruments from video streams and differentiate between vocal or instrumental pieces.
- This dataset could be used for audio source separation research, which is the process of isolating individual audio sources from a mix of sounds within an audio clip or recording. Source separation research often relies on datasets such as this one for providing labeled data about instrumentation and pitch levels that allow researchers to develop algorithms capable of separating multiple sound sources within a single mixture signal
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Inf...
Facebook
TwitterBy Liz Friedman [source]
Welcome to the Opportunity Insights Economic Tracker! Our goal is to provide a comprehensive, real-time look into how COVID-19 and stabilization policies are affecting the US economy. To do this, we have compiled a wide array of data points on spending and employment, gathered from several sources.
This dataset includes daily/weekly/monthly information at the state/county/city level for eight types of data: Google Mobility; Low-Income Employment and Earnings; UI Claims; Womply Merchants and Revenue; as well as weekly Math Learning from Zearn. Additionally, three files- Accounting for Geoids-State/County/City provide crosswalks between geographic areas that can be merged with other files having shared geographical levels.
Our goal here is to enable data users around the world to follow economic conditions in the US during this tumultuous period with maximum clarity and precision. We make all our datasets freely available so if you use them we kindly ask you attribute our work by linking or citing both our accompanying paper as well as this Economic Tracker at https://tracktherecoveryorg By doing so you are also agreeing to uphold our privacy & integrity standards which commit us both to individual & business confidentiality without compromising on independent nonpartisan research & policy analysis!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides US COVID-19 case and death data, as well as Google Community Mobility Reports, on the state/county level. Here is how to use this dataset:
- Understand the file structure: This dataset consists of three main files: 1) US Cases & Deaths by State/County, 2) Google Community Mobility Reports, and 3) Data from third-parties providing small business openings & revenue information and unemployment insurance claim data (Low Inc Earnings & Employment, UI Claims and Womply Merchants & Revenue).
- Select your Subset: If you are interested in particular types of data (e.g., mobility or employment), select the corresponding files from within each section based on your geographic area of interest – national, state or county level – as indicated in each filename.
- Review metadata variables: Become familiar with the provided variables so that you can select which ones you need to explore further in your analysis. For example, if analyzing mobility trends at a city level look for columns such as ‘Retailer_and_recreation_percent_change’ or ‘Transit Stations Percent Change’; if focusing on employment decline look for columns such pay or emp figures that align with industries of interest to you such as low-income earners (emp_{inclow},pay_{inclow}).
- Unify dateformatting across row values : Convert date formats into one common unit so that all entries have consistent formatting if necessary; for exampe some entries may display dates using YYYY/MM/DD notation while others may use MM//DD//YY format depending on their source datasets; make sure to review column labels carefully before converting units where needed..
Merge datasets where applicable : Utilize GeoID crosswalks to combine multiple sets with same geographical coverageregionally covering ; example might be combining low income earnings figures with specific county settings by reference geo codes found in related documents like GeoIDs-County .
6 . Visualise Data : Now that all the different measures have been reviewed can begin generating charts visualize findings . This process may include cleaning up raw figures normalizing across currency formats , mapping geospatial locations others ; once ready create bar graphs line charts maps other visual according aggregate output desired Insightful representations at this stage will help inform concrete policy decisions during outbreak recovery period..Remember to cite
- Estimating the Impact of the COVID-19 Pandemic on Small Businesses - By comparing county-level Womply revenue and employment data with pre-COVID data, policymakers can gain an understanding of the economic impact that COVID has had on local small businesses.
- Analyzing Effects of Mobility Restrictions - The Google Mobility data provides insight into geographic areas where...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains every word spoken by a character in the first 16 seasons of the TV show South Park. That's over 1 million words in all! Whether you're a fan of South Park or not, this is an interesting dataset to explore natural language processing and see what insights can be gleaned from such a large corpus of text
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains all of the words spoken by characters in the South Park TV show. It is divided into seasons, with each season containing a number of episodes. For each episode, there is a transcript of what was said by each character.
This dataset can be used to study the language used in the South Park TV show, as well as to study how the dialogue changes over time
- Sentiment analysis of the South Park scripts
- Word clouds for each character
- Finding the most common words used in each season
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: All-seasons.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |
File: Season-1.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |
File: Season-10.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |
File: Season-11.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |
File: Season-12.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |
File: Season-13.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |
File: Season-14.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | Episode | The episode number. (Numeric) | | Character | The character who spoke the line. (String) | | Line | The line spoken by the character. (String) |
File: Season-15.csv | Column name | Description | |:--------------|:-------------------------------------------| | Season | The season the episode is from. (Numeric) | | **...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Yelp Reviews Polarity dataset is a collection of Yelp reviews that have been labeled as positive or negative. This dataset is perfect for natural language processing tasks such as sentiment analysis
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This YELP reviews dataset is a great natural language processing dataset for anyone looking to get started with text classification. The data is split into two files: train.csv and test.csv. The training set contains 7,000 reviews with labels (0 = negative, 1 = positive), and the test set contains 3,000 unlabeled reviews.
To get started with this dataset, download the two CSV files and put them in the same directory. Then, open up train.csv in your favorite text editor or spreadsheet software (I like using Microsoft Excel). Next, take a look at the first few rows of data to get a feel for what you're working with:
text label So there is no way for me to plug it in here in the US unless I go by... 0
- This dataset could be used to train a machine learning model to classify Yelp reviews as positive or negative.
- This dataset could be used to train a machine learning model to predict the star rating of a Yelp review based on the text of the review.
- This dataset could be used to build a natural language processing system that generates fake Yelp reviews
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |
File: test.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterBy Kuzak Dempsy [source]
This dataset contains detailed information on the risk factors for cardiovascular disease. It includes information on age, gender, height, weight, blood pressure values, cholesterol levels, glucose levels, smoking habits and alcohol consumption of over 70 thousand individuals. Additionally it outlines if the person is active or not and if he or she has any cardiovascular diseases. This dataset provides a great resource for researchers to apply modern machine learning techniques to explore the potential relations between risk factors and cardiovascular disease that can ultimately lead to improved understanding of this serious health issue and design better preventive measures
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to explore the risk factors of cardiovascular disease in adults. The aim is to understand how certain demographic factors, health behaviors and biological markers affect the development of heart disease.
To start, look through the columns of data and familiarize yourself with each one. Understand what each field means and how it relates to heart health: - Age: Age of participant (integer) - Gender: Gender of participant (male/female). - Height: Height measured in centimeters (integer) - Weight: Weight measured in kilograms (integer) - Ap_hi: Systolic blood pressure reading taken from patient (integer) - Ap_lo : Diastolic blood pressure reading taken from patient (integer) - Cholesterol : Total cholesterol level read as mg/dl on a scale 0 - 5+ units( integer). Each unit denoting increase/decrease by 20 mg/dL respectively.
‐ Gluc : Glucose level read as mmol/l on a scale 0 - 16+ units( integer). Each unit denoting increase Decreaseby 1 mmol/L respectively. ‐ Smoke : Whether person smokes or not(binary; 0= No , 1=Yes). ‐ Alco : Whether person drinks alcohol or not(binary; 0 =No ,1 =Yes ). • Active : whether person physically active or not( Binary ;0 =No,1 = Yes ). . Cardio : whether person suffers from cardiovascular diseases or not(Binary ;0 – no , 1 ‑yes ).Identify any trends between the different values for each attribute and the developmetn for cardiovascular disease among individuals represented by this dataset . Age, gender, weight, lifestyle practices like smoking & drinking alcohol are all key influences when analyzing this problem set. You can always modify pieces of your analysis until you're able to find patterns that will enable you make conclusions based on your understanding & exploration. You can further enrich your understanding using couple mopdeling technique like Regressions & Classification models over this dataset alongwith latest Deep Learning approach! Have Fun!
- Analyzing the effect of lifestyle and environmental factors on the risk of cardiovascular disease.
- Predicting the risks of different age groups based on their demographic characteristics such as gender, height, weight and smoking status.
- Detecting patterns between levels of physical activity, blood pressure and cholesterol levels with likelihood of developing cardiovascular disease among individuals
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: heart_data.csv | Column name | Description | |:----------------|:---------------------------------------------------------| | age | Age of the individual. (Integer) | | gender | Gender of the individual. (String) | | height | Height of the individual in centimeters. (Integer) | | weight | Weight of the individual in kilograms. (Integer) | | ap_hi | Systolic blood pressure reading. (Integer) | | ap_lo | Diastolic blood pressure reading. (Integer) | | cholesterol | Cholesterol level of the individual. (Integer) | | gluc |...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset offers a fascinating insight into gender differences in fear-related personality traits and their correlation with physical strength across five university samples. It includes demographic information such as age, gender, and ethnicity, as well as physical strength measures – grip strength and chest strength – taken from undergraduate students from the University of California Santa Barbara, Oklahoma State University, University of Texas Austin, and Arizona State University. Additionally, the dataset includes self-report measures of HEXACO Emotionality to explore the effects of physical strength on fear-related personality traits - which is key information to consider when designing interventions for mental health issues. With this data we could discover how temperament affects physiological parameters such as grip or chest strength: Does having a fearful personality predispose someone to have decreased levels of physical power? How does this route differ dependeing on sex? Answering these questions could allow us to gain valuable insights into how greater bodily prowess affects unique psychological conditions that differ depending on gender. Do not miss out on this amazing opportunity to learn more about fear-induced personality features associated with physical force!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Using the physical strength and fear-related personality trait measures, universities can identify and target students who might need extra support in an intervention to improve mental health and wellbeing.
- Exploring correlations between physical strength measures, overall HEXACO Emotionality scores, as well as the Anxiousness, Fearfulness, Sentimentalism, and Emotional Dependence facets of HEXACO Emotionality to understand how gender differences in fear-related personality traits may be affected by physical strength.
- Comparing the distributions of simple demographic measures such as age and ethnicity across five different university samples to explore commonalities or potential differences among student populations at different universities
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: Sample_1.csv | Column name | Description | |:--------------|:----------------------------------------------------------------| | age | Age of the participant. (Numeric) | | female | Gender of the participant (1 = Female, 0 = Male). (Categorical) | | ethnicity | Ethnicity of the participant. (Categorical) | | grip | Grip strength of the participant. (Numeric) | | chest | Chest strength of the participant. (Numeric) | | e_anx_1 | Fearfulness score of the participant. (Numeric) | | e_anx_2 | Anxiety score of the participant. (Numeric) | | e_anx_3 | Sentimentalism score of the participant. (Numeric) | | e_anx_4 | Emotional Dependence score of the participant. (Numeric) | | e_anx_5 | Fearfulness score of the participant. (Numeric) | | e_anx_6 | Anxiety score of the participant. (Numeric) | | e_anx_7 | Sentimentalism score of the participant. (Numeric) | | e_anx_8 | Emotional Dependence score of the participant. (Numeric) | | e_anx_9 | Fearfulness score of the participant. (Numeric) | | e_anx_10 | Anxiety score of the participant. (Numeric) | | e_dep_1 | Sentimentalism score of the participant. (Numeric) | | e_dep_2 | Emotional Dependence score of the participant. (Numeric) | | e_dep_3 | Fearfulness score of the participant. (Numeric) | | e_dep_4 | Anxiety score of the participant. (Numeric) | | e_dep_5 | Sentimentalism score of the participant. ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SocialGrep [source]
The stonks movement spawned by this is a very interesting one. It's rare to see an Internet meme have such an effect on real-world economy - yet here we are.
This dataset contains a collection of posts and comments mentioning GME in their title and body text respectively. The data is procured using SocialGrep. The posts and the comments are labelled with their score.
It'll be interesting to see how this effects the stock market prices in the aftermath with this new dataset
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The file contains posts from Reddit mentioning GME and their score. This can be used to analyze how the sentiment on GME affected its stock prices in the aftermath
- To study how social media affects stock prices
- To study how Reddit affects stock prices
- To study how the sentiment of a subreddit affects stock prices
If you use this dataset in your research, please credit the original authors. Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: six-months-of-gme-on-reddit-comments.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | body | The body of the post or comment. (String) | | sentiment | The sentiment of the post or comment. (String) | | score | The score of the post or comment. (Integer) |
File: six-months-of-gme-on-reddit-posts.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | score | The score of the post or comment. (Integer) | | domain | The domain of the post or comment. (String) | | url | The URL of the post or comment. (String) | | selftext | The selftext of the post or comment. (String) | | title | The title of the post or comment. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides 69,000 instances of natural language processing (NLP) editing tasks to help researchers develop more effective AI text-editing models. Compiled into a convenient JSON format, this collection offers easy access so that researchers have the tools they need to create groundbreaking AI models that efficiently and effectively redefine natural language processing. This is your chance to be at the forefront of NLP technology and make history through innovative AI capabilities. So join in and unlock a world of possibilities with CoEdIT's Text Editing Dataset!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Familiarize yourself with the format of the dataset by taking a look at the columns: task, src, tgt. You’ll see that each row in this dataset contains a specific NLP editing task as well as source text (src) and target text (tgt) which displays what should result from that editing task.
- Import the JSON file of this dataset into your machine learning environment or analyses software toolbox of choice. Some popular options include Python's Pandas library and BigQuery on Google Cloud Platforms for larger datasets like this one oryoou can also import them into Excel Toolboxes .
Once you've imported the data into your chosen program, you can now start exploring! Take a look around at various rows to get an idea of how different types of edits need to be made on source text in order to produce target text successfully meeting given criteria depending on needs/ tasks come together; Make sure you read any documents associated with each column helps understand better context before beginning your analysis or coding part
Test out coding solutions which process different types and scales of edits - if understanding how punctuation impacts sentence similarity measures gives key insight into meaning being conveyed then develop code accordingly ,playing around with different methods utilizing common ML/NLP algorithms & libraries like NLTK , etc
5 Finally – now that have tested conceptual ideas begin work creating efficient & effective AI-powered models system using training data specifically catered towards given tasks at hand; Evaluate performance with validation & test datasets prior getting production ready
- Automated Grammar Checking Solutions: This dataset can be used to train machine learning models to detect grammatical errors and suggest proper corrections.
- Text Summarization: Using this dataset, researchers can create AI-powered summarization algorithms that summarize long-form passages into shorter summaries while preserving accuracy and readability
- Natural Language Generation: This dataset could be used to develop AI solutions that generate accurately formatted natural language sentences when given a prompt or some other form of input
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------------------| | Task | This column describes the task that the dataset is intended to be used for. (String) | | src | This column contains the source text input. (String) | | tgt | This column contains the target text output. (String) |
File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------------------| | Task | This column describes the task that the dataset is intended to be used for. (String) | | src | This column contains the source text input. (String) | | tgt | This column contains the target text output. (String) ...