Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Reddit [source]
This dataset offers an insightful analysis into one of the most talked-about online communities today: Reddit. Specifically, we are focusing on the funny subreddit, a subsection of the main forum that enjoys the highest engagement across all Reddit users. Not only does this dataset include post titles, scores and other details regarding post creation and engagement; it also includes powerful metrics to measure active community interaction such as comment numbers and timestamps. By diving deep into this data, we can paint a fuller picture in terms of what people find funny in our digital age - how well do certain topics draw responses? How does sentiment change over time? And how can community managers use these insights to grow their platforms and better engage their userbase for lasting success? With this comprehensive dataset at your fingertips, you'll be able to answer each question - and more
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Introduction
Welcome to the Reddit's Funny Subreddit Kaggle Dataset. In this dataset you will explore and analyze posts from the popular subreddit to gain insights into community engagement. With this dataset, you can understand user engagement trends and learn how people interact with content from different topics. This guide will provide further information about how to use this dataset for your data analysis projects.
Important Columns
This datasets contains columns such as: title, score, url, comms_num (number of comments), created (date of post), body (content of post) and timestamp. All these columns are important in understanding user interactions with each post on Reddit’s Funny Subreddit.
Exploratory Data Analysis
In order to get a better understanding of user engagement on the subreddit, some initial exploration is necessary. By using graphical tools such as histograms or boxplots we can understand basic parameter values like scores or comments numbers for each post in the subreddit easily by just observing their distribution over time or through different parameters (for example: type of joke).
Dimensionality reduction
For more advanced analytics it is recommended that a dimensionality reduction technique like PCA should be used first before tackling any real analysis tasks so that similar posts can be grouped together and easier conclusions regarding them can be drawn out later on more confidently by leaving out any kind of conflicting/irrelevant variables which could cloud up any data-driven decisions taken forward at a later date if not properly accounted for early on in an appropriate manner after dimensional consolidation has been performed successfully first correctly effectively right off the bat once starting out cleanly and properly upfront accordingly throughout..
Further Guidance
If further assistance with using this dataset is required then further readings into topics like text mining, natural language processing , machine learning , etc are highly recommended where detailed explanation related to various steps which could help unlock greater value from Reddit's funny subreddits are explained elaborately hopefully giving readers or researchers ideas over what sort of approaches need being taking when it comes analyzing text-based online service platforms such as Reddit during data analytics/science related tasks
- Analyzing post title length vs. engagement (i.e., score, comments).
- Comparing sentiment of post bodies between posts that have high/low scores and comments.
- Comparing topics within the posts that have high/low scores and comments to look for any differences in content or style of writing based on engagement level
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: funny.csv | Column name | Description | |:--------------|:------------------------...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
hello everyone. I wanted to share here my data set, which is completely my own product of imagination and has no connection with the real data that I wrote for my own data analysis training, for analysts working on data analysis. have fun:))))))
UserId: Unique identifier for each user in the data set
UsageDuration: Total time spent by the user on social media in hours
Age: Age of the user in years
Country: Country of residence of the user
TotalLikes: Total number of likes giving by the user in a day
Facebook
TwitterThe Exhibit of Datasets was an experimental project with the aim of providing concise introductions to research datasets in the humanities and social sciences deposited in a trusted repository and thus made accessible for the long term. The Exhibit consists of so-called 'showcases', short webpages summarizing and supplementing the corresponding data papers, published in the Research Data Journal for the Humanities and Social Sciences. The showcase is a quick introduction to such a dataset, a bit longer than an abstract, with illustrations, interactive graphs and other multimedia (if available). As a rule it also offers the option to get acquainted with the data itself, through an interactive online spreadsheet, a data sample or link to the online database of a research project. Usually, access to these datasets requires several time consuming actions, such as downloading data, installing the appropriate software and correctly uploading the data into these programs. This makes it difficult for interested parties to quickly assess the possibilities for reuse in other projects.
The Exhibit aimed to help visitors of the website to get the right information at a glance by: - Attracting attention to (recently) acquired deposits: showing why data are interesting. - Providing a concise overview of the dataset's scope and research background; more details are to be found, for example, in the associated data paper in the Research Data Journal (RDJ). - Bringing together references to the location of the dataset and to more detailed information elsewhere, such as the project website of the data producers. - Allowing visitors to explore (a sample of) the data without downloading and installing associated software at first (see below). - Publishing related multimedia content, such as videos, animated maps, slideshows etc., which are currently difficult to include in online journals as RDJ. - Making it easier to review the dataset. The Exhibit would also have been the right place to publish these reviews in the same way as a webshop publishes consumer reviews of a product, but this could not yet be achieved within the limited duration of the project.
Note (1) The text of the showcase is a summary of the corresponding data paper in RDJ, and as such a compilation made by the Exhibit editor. In some cases a section 'Quick start in Reusing Data' is added, whose text is written entirely by the editor. (2) Various hyperlinks such as those to pages within the Exhibit website will no longer work. The interactive Zoho spreadsheets are also no longer available because this facility has been discontinued.
Facebook
TwitterExplore the world of data visualization with this Power BI dataset containing HR Analytics and Sales Analytics datasets. Gain insights, create impactful reports, and craft engaging dashboards using real-world data from HR and sales domains. Sharpen your Power BI skills and uncover valuable data-driven insights with this powerful dataset. Happy analyzing!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1]
The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities.
Data Set description:
The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine.
The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated.
The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments.
Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities.
References:
[1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted)
[2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012.
[3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265.
[4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055.
Facebook
TwitterOpen Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
Facebook
TwitterThe Backbencher Dataset is a unique and fun dataset designed to analyze and explore student behaviors and trends in classrooms. This dataset focuses on the attendance patterns, assignment completion rates, and other factors that influence a student’s academic performance, with a quirky twist: it includes a column identifying whether the student wears glasses!
This dataset is ideal for Machine Learning practitioners and Data Science enthusiasts who want to work on real-world datasets with an engaging context. It can be used for various ML problems such as:
Predictive Analytics: Predicting student performance based on attendance and assignments. **Clustering Analysis: **Grouping students based on shared characteristics. Classification Tasks: Classifying students as "active" or "inactive" based on participation metrics. Key Features: USN: Unique Student Number for identification. Name: Student names (for reference). Attendance (%): Percentage of classes attended. Assignments Completed: Number of assignments completed. Exam Scores: Performance in exams. Participation in Activities: Measures involvement in extracurricular activities. Glasses (Yes/No): Whether the student wears glasses (interesting feature for pattern recognition). Use Cases: Educational data analysis and predictive modeling. Creating engaging ML projects for students and beginners. Developing dashboards for visualizing student performance trends.
Facebook
TwitterThe analysis of research data plays a key role in data-driven areas of science. Varieties of mixed research data sets exist and scientists aim to derive or validate hypotheses to find undiscovered knowledge. Many analysis techniques identify relations of an entire dataset only. This may level the characteristic behavior of different subgroups in the data. Like automatic subspace clustering, we aim at identifying interesting subgroups and attribute sets. We present a visual-interactive system that supports scientists to explore interesting relations between aggregated bins of multivariate attributes in mixed data sets. The abstraction of data to bins enables the application of statistical dependency tests as the measure of interestingness. An overview matrix view shows all attributes, ranked with respect to the interestingness of bins. Complementary, a node-link view reveals multivariate bin relations by positioning dependent bins close to each other. The system supports information drill-down based on both expert knowledge and algorithmic support. Finally, visual-interactive subset clustering assigns multivariate bin relations to groups. A list-based cluster result representation enables the scientist to communicate multivariate findings at a glance. We demonstrate the applicability of the system with two case studies from the earth observation domain and the prostate cancer research domain. In both cases, the system enabled us to identify the most interesting multivariate bin relations, to validate already published results, and, moreover, to discover unexpected relations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Inferring gene regulatory relationships from observational data is challenging. Manipulation and intervention is often required to unravel causal relationships unambiguously. However, gene copy number changes, as they frequently occur in cancer cells, might be considered natural manipulation experiments on gene expression. An increasing number of data sets on matched array comparative genomic hybridisation and transcriptomics experiments from a variety of cancer pathologies are becoming publicly available. Here we explore the potential of a meta-analysis of thirty such data sets. The aim of our analysis was to assess the potential of in silico inference of trans-acting gene regulatory relationships from this type of data. We found sufficient correlation signal in the data to infer gene regulatory relationships, with interesting similarities between data sets. A number of genes had highly correlated copy number and expression changes in many of the data sets and we present predicted potential trans-acted regulatory relationships for each of these genes. The study also investigates to what extent heterogeneity between cell types and between pathologies determines the number of statistically significant predictions available from a meta-analysis of experiments.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the advent of high-throughput measurement techniques, scientists and engineers are starting to grapple with massive data sets and encountering challenges with how to organize, process and extract information into meaningful structures. Multidimensional spatio-temporal biological data sets such as time series gene expression with various perturbations over different cell lines, or neural spike trains across many experimental trials, have the potential to acquire insight about the dynamic behavior of the system. For this potential to be realized, we need a suitable representation to understand the data. A general question is how to organize the observed data into meaningful structures and how to find an appropriate similarity measure. A natural way of viewing these complex high dimensional data sets is to examine and analyze the large-scale features and then to focus on the interesting details. Since the wide range of experiments and unknown complexity of the underlying system contribute to the heterogeneity of biological data, we develop a new method by proposing an extension of Robust Principal Component Analysis (RPCA), which models common variations across multiple experiments as the lowrank component and anomalies across these experiments as the sparse component. We show that the proposed method is able to find distinct subtypes and classify data sets in a robust way without any prior knowledge by separating these common responses and abnormal responses. Thus, the proposed method provides us a new representation of these data sets which has the potential to help users acquire new insight from data.
Facebook
TwitterThis a dataset of finances which are also available in Power BI for practice. Use this dataset to practice Power BI.
Facebook
TwitterThis paper describes a method and system for integrating machine learning with planning and data visualization for the management of mobile sensors for Earth science investigations. Data mining identifies discrepancies between previous observations and predictions made by Earth science models. Locations of these discrepancies become interesting targets for future observations. Such targets become goals used by a flight planner to generate the observation activities. The cycle of observation, data analysis and planning is repeated continuously throughout a multi-week Earth science investigation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides comprehensive information on various factors contributing to hair fall. The dataset contains 717 responses from a survey designed to capture details about individual hair care practices, lifestyle choices, and genetic predispositions.
Features:
Potential Uses: Researchers, data scientists, and healthcare professionals can use this dataset to analyze the factors influencing hair fall. It is particularly useful for:
-- Identifying patterns and correlations among various factors contributing to hair fall. -- Developing predictive models to forecast the likelihood of hair fall based on individual attributes. -- Designing personalized hair care and treatment plans. -- Conducting exploratory data analysis to uncover new insights about hair health.
Future Predictions: From this dataset, future predictions can be made regarding:
-- The impact of lifestyle choices on hair fall severity. -- The likelihood of hair fall based on genetic predispositions and family history. -- The effectiveness of different hair care products and practices. -- The relationship between stress levels and hair fall.
This dataset serves as a valuable resource for advancing the understanding of hair fall causes and developing targeted solutions to mitigate this common issue.
Facebook
TwitterBackground In microarray data analysis, the comparison of gene-expression profiles with respect to different conditions and the selection of biologically interesting genes are crucial tasks. Multivariate statistical methods have been applied to analyze these large datasets. Less work has been published concerning the assessment of the reliability of gene-selection procedures. Here we describe a method to assess reliability in multivariate microarray data analysis using permutation-validated principal components analysis (PCA). The approach is designed for microarray data with a group structure.
Results
We used PCA to detect the major sources of variance underlying the hybridization conditions followed by gene selection based on PCA-derived and permutation-based test statistics. We validated our method by applying it to well characterized yeast cell-cycle data and to two datasets from our laboratory. We could describe the major sources of variance, select informative genes and visualize the relationship of genes and arrays. We observed differences in the level of the explained variance and the interpretability of the selected genes.
Conclusions
Combining data visualization and permutation-based gene selection, permutation-validated PCA enables one to illustrate gene-expression variance between several conditions and to select genes by taking into account the relationship of between-group to within-group variance of genes. The method can be used to extract the leading sources of variance from microarray data, to visualize relationships between genes and hybridizations and to select informative genes in a statistically reliable manner. This selection accounts for the level of reproducibility of replicates or group structure as well as gene-specific scatter. Visualization of the data can support a straightforward biological interpretation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Fraser Greenlee (From Huggingface) [source]
This dataset offers a valuable resource for various applications such as natural language processing, sentiment analysis, joke generation algorithms, or simply for entertainment purposes. Whether you're a data scientist looking to analyze humor patterns or an individual seeking some quick comedic relief, this dataset has got you covered.
By utilizing this dataset, researchers can explore different aspects of humor and study the linguistic features that make these short jokes amusing. Moreover, it provides an opportunity for developing computer models capable of generating similar humorous content based on learned patterns.
Understanding the Columns:
text: This column contains the text of the short joke.**text: No information is provided about this column.Exploring the Jokes:
- Start by exploring the
textcolumn, which contains the actual jokes. You can read through them and have a good laugh!Analyzing the Jokes:
- To gain insights from this dataset, you can perform various analyses:
- Sentiment Analysis: Use Natural Language Processing techniques to analyze the sentiment of each joke.
- Categorization: Group jokes based on common themes or subjects, such as animals, professions, etc.
- Length Distribution: Analyze and visualize the distribution of joke lengths.
Creating New Content or Applications: Since this dataset provides a large collection of short jokes, you can utilize it creatively:
- Generating Random Jokes: Develop an algorithm that generates new jokes based on patterns found in this dataset.
- Humor Classification: Build a model that predicts if a given piece of text is funny or not using machine learning techniques.
Sharing Your Findings: If you make interesting discoveries or create unique applications using this dataset, consider sharing them with others in Kaggle community.
Please note that no information regarding dates is available in train.csv; therefore, any temporal analysis or date-based insights won't be feasible with this specific file.
- Analyzing humor patterns: This dataset can be used to analyze different types of humor and identify patterns or common elements in jokes that make them funny. Researchers and linguists can use this dataset to gain insights into the structure, wordplay, or comedic techniques used in short jokes.
- Natural language processing: With the text data available in this dataset, it can be used for training models in natural language processing (NLP) tasks such as sentiment analysis, joke generation, or understanding humor from written text. NLP researchers and developers can utilize this dataset to build and improve algorithms for detecting or generating funny content.
- Social media analysis: Short jokes are popular on social media platforms like Twitter or Reddit where users frequently share humorous content. This dataset can be valuable for analyzing the reception and impact of these jokes on social media platforms. By examining trends, engagement metrics, or user reactions to specific jokes from the dataset, marketers or social media analysts can gain insights into what type of humor resonates with different online communities. Overall, this dataset provides a rich resource for exploring various aspects related to humor analysis and NLP tasks while offering opportunities for sociocultural studies related to online comedy culture
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:----------------------------------------------| | text | The actual content of the short jokes. (Text) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Fraser Greenlee (From Huggingface).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The database for this study (Briganti et al. 2018; the same for the Braun study analysis) was composed of 1973 French-speaking students in several universities or schools for higher education in the following fields: engineering (31%), medicine (18%), nursing school (16%), economic sciences (15%), physiotherapy, (4%), psychology (11%), law school (4%) and dietetics (1%). The subjects were 17 to 25 years old (M = 19.6 years, SD = 1.6 years), 57% were females and 43% were males. Even though the full dataset was composed of 1973 participants, only 1270 answered the full questionnaire: missing data are handled using pairwise complete observations in estimating a Gaussian Graphical Model, meaning that all available information from every subject are used.
The feature set is composed of 28 items meant to assess the four following components: fantasy, perspective taking, empathic concern and personal distress. In the questionnaire, the items are mixed; reversed items (items 3, 4, 7, 12, 13, 14, 15, 18, 19) are present. Items are scored from 0 to 4, where “0” means “Doesn’t describe me very well” and “4” means “Describes me very well”; reverse-scoring is calculated afterwards. The questionnaires were anonymized. The reanalysis of the database in this retrospective study was approved by the ethical committee of the Erasmus Hospital.
Size: A dataset of size 1973*28
Number of features: 28
Ground truth: No
Type of Graph: Mixed graph
The following gives the description of the variables:
Feature FeatureLabel Domain Item meaning from Davis 1980
001 1FS Green I daydream and fantasize, with some regularity, about things that might happen to me.
002 2EC Purple I often have tender, concerned feelings for people less fortunate than me.
003 3PT_R Yellow I sometimes find it difficult to see things from the “other guy’s” point of view.
004 4EC_R Purple Sometimes I don’t feel very sorry for other people when they are having problems.
005 5FS Green I really get involved with the feelings of the characters in a novel.
006 6PD Red In emergency situations, I feel apprehensive and ill-at-ease.
007 7FS_R Green I am usually objective when I watch a movie or play, and I don’t often get completely caught up in it.(Reversed)
008 8PT Yellow I try to look at everybody’s side of a disagreement before I make a decision.
009 9EC Purple When I see someone being taken advantage of, I feel kind of protective towards them.
010 10PD Red I sometimes feel helpless when I am in the middle of a very emotional situation.
011 11PT Yellow sometimes try to understand my friends better by imagining how things look from their perspective
012 12FS_R Green Becoming extremely involved in a good book or movie is somewhat rare for me. (Reversed)
013 13PD_R Red When I see someone get hurt, I tend to remain calm. (Reversed)
014 14EC_R Purple Other people’s misfortunes do not usually disturb me a great deal. (Reversed)
015 15PT_R Yellow If I’m sure I’m right about something, I don’t waste much time listening to other people’s arguments. (Reversed)
016 16FS Green After seeing a play or movie, I have felt as though I were one of the characters.
017 17PD Red Being in a tense emotional situation scares me.
018 18EC_R Purple When I see someone being treated unfairly, I sometimes don’t feel very much pity for them. (Reversed)
019 19PD_R Red I am usually pretty effective in dealing with emergencies. (Reversed)
020 20FS Green I am often quite touched by things that I see happen.
021 21PT Yellow I believe that there are two sides to every question and try to look at them both.
022 22EC Purple I would describe myself as a pretty soft-hearted person.
023 23FS Green When I watch a good movie, I can very easily put myself in the place of a leading character.
024 24PD Red I tend to lose control during emergencies.
025 25PT Yellow When I’m upset at someone, I usually try to “put myself in his shoes” for a while.
026 26FS Green When I am reading an interesting story or novel, I imagine how I would feel if the events in the story were happening to me.
027 27PD Red When I see someone who badly needs help in an emergency, I go to pieces.
028 28PT Yellow Before criticizing somebody, I try to imagine how I would feel if I were in their place
More information about the dataset is contained in empathy_description.html file.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GNSS Dataset (with Interference and Spoofing) consists of three parts: Part I (Raw data of 2023 12 to 20, Sept, 2023, clean data), Part II (Raw data of 2023 21 to 30, Sept, 2023, clean data) and Part III (Processed data 12 to 30, Sept, 2023; data collected with spoofing and jamming on 21 Dec, 2023; Scripts and Material) .
The data were recorded by a GNSS receiver installed on the 5th floor of the Science Hall of Yunnan University. HackRF One emits spoofing signals and the commercial jammer emits suppression jamming to attack the receiver on 21 Dec, 2023. The provided datasets are interesting for the GNSS monitoring, GNSS security, anti-jamming and anti-spoofing mechanisms based scientific communities.
These data provide the most comprehensive information available on the spatial and temporal patterns of GNSS satellites, observation and receiver parameters, referring to five constellations (GPS, Campass, Galileo, GLONASS and QZSS) and eight signal bands (L1C/A, L2C, E1, E5b, B1, B2, L1, L2). The dataset provides observations of the receiver three scenarios: normal state, affected by commercial jammers, and spoofed by SDR HackRF One. These observations include more details such as carrier-to-noise density ratio (C/N0), signal spectrum, Doppler shift, pseudorange, carrier phase, satellite health indicator, real-time position data and dilution of precision (DOP), etc. These data can be used to analyze navigation satellite operation rules, satellite covering time above the receiver, satellite overhead time prediction and GNSS monitoring system construction, provides a large amount of fine-grained data that can be used as an example to study safeguards at civil aviation airports, monitoring for harmful radio interference.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years. We make available here a link to our OSF repository (see below) which has acoustic measures datasets for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), with information about dataset generation. In addition, at the OSF site, we provide Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only. Datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. In addition, we used whitelisting to anonymise measures datasets generated from non-public, restricted corpora.
Obtaining a data visualization of a text search within seconds via generic, large-scale search algorithms, such as Google n-gram viewer, is available to anyone. By contrast, speech research is only now entering its own 'big data' revolution. Historically, linguistic research has tended to carry out fine-grained analysis of a few aspects of speech from one or a few languages or dialects. The current scale of speech research studies has shaped our understanding of spoken language and the kinds of questions that we ask. Today, massive digital collections of transcribed speech are available from many different languages, gathered for many different purposes: from oral histories, to large datasets for training speech recognition systems, to legal and political interactions. Sophisticated speech processing tools exist to analyze these data, but require substantial technical skill. Given this confluence of data and tools, linguists have a new opportunity to answer fundamental questions about the nature and development of spoken language.
Our project seeks to establish the key tools to enable large-scale speech research to become as powerful and pervasive as large-scale text mining. It is based on a partnership of three teams based in Scotland, Canada and the US. Together we exploit methods from computing science and put them to work with tools and methods from speech science, linguistics and digital humanities, to discover how much the sounds of English across the Atlantic vary over space and time.
We have developed innovative and user-friendly software which exploits the availability of existing speech data and speech processing tools to facilitate large-scale integrated speech corpus analysis across many datasets together. The gains of such an approach are substantial: linguists will be able to scale up answers to existing research questions from one to many varieties of a language, and ask new and different questions about spoken language within and across social, regional, and cultural, contexts. Computational linguistics, speech technology, forensic and clinical linguistics researchers, who engage with variability in spoken language, will also benefit directly from our software. This project also opens up vast potential for those who already use digital scholarship for spoken language collections in the humanities and social sciences more broadly, e.g. literary scholars, sociologists, anthropologists, historians, political scientists. The possibility of ethically non-invasive inspection of speech and texts will allow analysts to uncover far more than is possible through textual analysis alone.
Our project has developed and applied our new software to a global language, English, using existing public and private spoken datasets of Old World (British Isles) and New World (North American) English, across an effective time span of more than 100 years, spanning the entire 20th century. Much of what we know about spoken English comes from influential studies on a few specific aspects of speech from one or two dialects. This vast literature has established important research questions which has been investigated for the first time on a much larger scale, through standardized data across many different varieties of English.
Our large-scale study complements current-scale studies, by enabling us to consider stability and change in English across the 20th century on an unparalleled scale. The global nature of English means that our findings will be interesting and relevant to a large international non-academic audience; they have been made accessible through an innovative and dynamic visualization of linguistic variation via an interactive sound mapping website. In addition to new insights into spoken English, this project also lays the crucial groundwork for large-scale speech studies across many datasets from different languages, of different formats and structures.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Obtaining all types of data (Numerical, Temporal, Image, categorical, CSV, Dicom) in a short and malleable format for quick and easy use was something that I, as a learner, wished I had. The huge and complex nature of publicly available datasets were sometimes too intimidating for beginners and for professionals when they want to do just a quick sanity check for their algorithm on another dataset. So this dataset aims to solve exactly that problem.
The ****Diverse Algorithms Analysis Dataset (DAAD)**** contains several different types of datasets all grouped into one for easy access to a learner. It contains concise, and well-documented data to help you jump start your implementations of algorithms.
A user can use this dataset in several ways:
This dataset is intended to be dynamic. But the current version contains the following:
Pokemon_categorical: A CSV file that contains different information relating to every pokemon in a categorical format. Information such as abilities, attack, defense, points, etc is present. The objective is to predict whether a pokemon is legendary or not. So a typical binary classification problem.
Pokemon_numerical: A CSV file pretty similar to Pokemon_categorical but with a lesser number of categorical features and more stress on numeric scores like points, HP, Generation including attack, special attack, defense scores, etc. The objective is once again a binary classification of whether a pokemon is legendary or not.
Stock_forecasting: A CSV file that contains the stock price of a multinational company obtained over a continuous rolling 2 year period. Ideal for beginners to dive into stock-prediction and for training simple to complex regression models. Best results obtained using sequence training models like RNNs, LSTMs or GRUs
Temperatures_3_years: A CSV file that contains the daily minimum temperatures of a city recorded over a rolling 3 year period. The objective can be modeled according to user needs. ou may choose to predict the temperatures for the next month or a day-wise prediction as well. This dataset performs very well with LSTMs and shows considerable performance on boosting algorithms.
License plate number detection: This dataset contains about 120 train and 50 test images ( a compact version of a larger dataset) of number plates of cars. The user can try out several ROI-pooling, Image localization and detection techniques along with implementing some cool OCRs on the dataset. The small size of the dataset can help you train faster and help you generalize easily. Ideal for a beginner to Computer Vision.
University_Recruitment_Data: This contains information which encompasses the bio-data of a student and his/her credentials. The work experience, degree percentage, and other such relevant factors are present. The objective is basically to solve a simple binary classification problem of whether the student will be recruited or not.
(to be contd...)
As I initially mentioned, it would have been a valuable resource for me to have such a dataset where I can train and deploy my models with relative ease and have to worry less about scavenging through several data sources. I intend DAAD to be a repository that can facilitate the needs of all types of ML enthusiasts/developers. I would appreciate contributions from my fellow kagglers too in enriching this dataset making it reachable to all and ideal for simple and quick implementations without losing out on the reliability factor as in huge datasets.
Have Fun !
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
> Observations: 8,036 unique datasets
> Variables: 14
> Current As: 16/01/2018
For a bit of fun I thought i'd write a quick script to retrieve all of the Kaggle datasets and do a bit of analysis on it.
The dataset contains all the unique datasets hosted on Kaggle since existence, and each one links off to it.
If the community is interested I am tempted to scrape over each one and retrieve each datasets metadata, consolidate a huge Kaggle data dictionary?
Observations: 8,036
Variables: 14
$ title
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Reddit [source]
This dataset offers an insightful analysis into one of the most talked-about online communities today: Reddit. Specifically, we are focusing on the funny subreddit, a subsection of the main forum that enjoys the highest engagement across all Reddit users. Not only does this dataset include post titles, scores and other details regarding post creation and engagement; it also includes powerful metrics to measure active community interaction such as comment numbers and timestamps. By diving deep into this data, we can paint a fuller picture in terms of what people find funny in our digital age - how well do certain topics draw responses? How does sentiment change over time? And how can community managers use these insights to grow their platforms and better engage their userbase for lasting success? With this comprehensive dataset at your fingertips, you'll be able to answer each question - and more
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Introduction
Welcome to the Reddit's Funny Subreddit Kaggle Dataset. In this dataset you will explore and analyze posts from the popular subreddit to gain insights into community engagement. With this dataset, you can understand user engagement trends and learn how people interact with content from different topics. This guide will provide further information about how to use this dataset for your data analysis projects.
Important Columns
This datasets contains columns such as: title, score, url, comms_num (number of comments), created (date of post), body (content of post) and timestamp. All these columns are important in understanding user interactions with each post on Reddit’s Funny Subreddit.
Exploratory Data Analysis
In order to get a better understanding of user engagement on the subreddit, some initial exploration is necessary. By using graphical tools such as histograms or boxplots we can understand basic parameter values like scores or comments numbers for each post in the subreddit easily by just observing their distribution over time or through different parameters (for example: type of joke).
Dimensionality reduction
For more advanced analytics it is recommended that a dimensionality reduction technique like PCA should be used first before tackling any real analysis tasks so that similar posts can be grouped together and easier conclusions regarding them can be drawn out later on more confidently by leaving out any kind of conflicting/irrelevant variables which could cloud up any data-driven decisions taken forward at a later date if not properly accounted for early on in an appropriate manner after dimensional consolidation has been performed successfully first correctly effectively right off the bat once starting out cleanly and properly upfront accordingly throughout..
Further Guidance
If further assistance with using this dataset is required then further readings into topics like text mining, natural language processing , machine learning , etc are highly recommended where detailed explanation related to various steps which could help unlock greater value from Reddit's funny subreddits are explained elaborately hopefully giving readers or researchers ideas over what sort of approaches need being taking when it comes analyzing text-based online service platforms such as Reddit during data analytics/science related tasks
- Analyzing post title length vs. engagement (i.e., score, comments).
- Comparing sentiment of post bodies between posts that have high/low scores and comments.
- Comparing topics within the posts that have high/low scores and comments to look for any differences in content or style of writing based on engagement level
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: funny.csv | Column name | Description | |:--------------|:------------------------...