Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We include the course syllabus used to teach quantitative research design and analysis methods to graduate Linguistics students using a blended teaching and learning approach. The blended course took place over two weeks and builds on a face to face course presented over two days in 2019. Students worked through the topics in preparation for a live interactive video session each Friday to go through the activities. Additional communication took place on Slack for two hours each week. A survey was conducted at the start and end of the course to ascertain participants' perceptions of the usefulness of the course. The links to online elements and the evaluations have been removed from the uploaded course guide.Participants who complete this workshop will be able to:- outline the steps and decisions involved in quantitative data analysis of linguistic data- explain common statistical terminology (sample, mean, standard deviation, correlation, nominal, ordinal and scale data)- perform common statistical tests using jamovi (e.g. t-test, correlation, anova, regression)- interpret and report common statistical tests- describe and choose from the various graphing options used to display data- use jamovi to perform common statistical tests and graph resultsEvaluationParticipants who complete the course will use these skills and knowledge to complete the following activities for evaluation:- analyse the data for a project and/or assignment (in part or in whole)- plan the results section of an Honours research project (where applicable)Feedback and suggestions can be directed to M Schaefer schaemn@unisa.ac.za
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study addresses the challenge of quantifying shared knowledge in group discussions through text analysis. Topic modeling was applied to systematically evaluate how information sharing influences knowledge structures and decision-making. In an online group discussion setting, two mock jury experiments involving 204 participants were conducted to reach a consensus on a verdict for a fictional murder case. The first experiment investigated whether the bias in pre-shared information influenced the topic ratios of each participant. Topic ratios, derived from a Latent Dirichlet Allocation model, were assigned to each participant's chat lines. The presence or absence of shared information, as well as the type of information shared, systematically influenced the topic ratios that appeared in group discussions. In Experiment 2, false memories were assessed before and after the discussion to evaluate whether the topics identified in Experiment 1 measured shared knowledge. Mediation analysis indicated that a higher topic ratio related to evidence was statistically associated with an increased likelihood of false memory for evidence. These results suggested that topics yielded by LDA reflected the knowledge structure shared during group discussions.
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method...
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Data Collection Dance practitioners are a group in urgent need of attention. Reviewing the literature, there are problems in the professional status of dance practitioners. These include low and unstable financial income, limited professional development, multiple pressures and role conflicts faced by practitioners, social class and social acceptance, and the wider impact on the discipline of dance. As the core group that promotes the development of the professional field of dance, the contribution and return of dance practitioners do not match, which will largely frustrate the career expectations and sense of achievement of dance practitioners, leading to job burnout. This will not only lead to the loss of talent within the industry, but also affect the current and future development of the professional field of dance. Through induction, this study conducted semi-structured interviews with 10 qualified dance practitioners in Chinese universities, compared the cultural capital accumulated by dance practitioners in the early stage with the actual economic and social capital in the later stage, and investigated the status quo of cultural capital transformation of dance practitioners. Second, what challenges or obstacles do dance practitioners face in the transformation of cultural capital? And why they were able to hold on to their dance careers in the face of difficulties. Through the exploration of these stories, it can be seen that the transformation of cultural capital accumulated by dance practitioners into economic capital is difficult, and the transformation into social capital is more significant. Dance practitioners' love for dance and spiritual values are the main motivators that help them overcome challenges and obstacles. They now face the dual challenges of physical and mental health and job burnout. This study considers implications for future research and practical applications and hopes to provide a call for concern for the health and safety of dance practitioners, as well as relevant supporting materials. Methods Data Collection The teacher interviews were focused on June to July 2024. A semi-structured interview was conducted with each participant in a library discussion room at the University of Edinburgh, UK. Due to geographic and temporal differences, all interviews were conducted online through Teams, and the timing of the interviews was random, referencing the time of the participants. Each interview lasts 45 to 60 minutes. In order to make the interview more smooth, each participant has access to interview questions in advance. Before each participant was interviewed, during the interview, there were videotaped video recordings and audio interview scripts, and the key points emphasized by the interviewer to the participant were marked and recorded. The semi-structured interviews in this study include four themes: (1) cultural capital accumulation; (2) The transformation of cultural capital into economic and social capital; (3) Challenges faced and coping strategies; (4) Reasons and prospects for persistence. The purpose of the interview is to understand whether the cultural capital of dance practitioners (college dance teachers) such as dance professional knowledge, academic background, accumulated professional certificates and honors can be transformed into actual economic and social capital; As well as the actual performance and some challenges and obstacles in the transformation process; In the face of difficulties and pressure, I chose to stick to my dance career. During the interview, I asked the participants 23 questions, including open questions, closed questions and leading questions. After the participant described specific events and feelings during the interview, the interviewer would again summarize what the participant had said in a general way, such as: "This is......... ? What are you trying to say?" "So you think......" Ensure data accuracy. In addition, interviewers focus primarily on open-ended questions. When the intervieee is unable to continue to answer in-depth questions, the interviewer will guide to a certain extent, reducing the deliberate guidance and intervention of the interviewer. For example, "What did you just share...... Can you tell me more?" Based on your personal experience, what do you think is the cause of...... While it's helpful to have a basic interview guide, it's also important for interviewers to "actively listen and move the interview forward as much as possible by building on what the participants have already begun to share" (Seidman, 2013). All participants' data is kept in onedrive's university account and can only be shared between the author and the tutor. All data will be destroyed within 30 days of the completion of the paper. Data Analysis In my data analysis, I used the thematic analysis approach. Thematic analysis is a method of identifying and recording relevant patterns in methodological data that, despite multiple approaches, often follows the process from coding data to reporting and discussing analytical topics. By extracting statements from large amounts of qualitative data, thematic analysis can enable data analysis to become coherent and transparent to the reader, and thus can strongly support data analysis. Theme analysis consists of six steps: familiarization with the data, preliminary coding, finding topics, reviewing topics, defining and naming topics, and finally writing a report (Braun & Clarke, 2006; Miles & Huberman). Each participant interviews were audio-recorded and transcribed. First, I read the data of each participant and paid special attention to the exclusivity of the data during the second level of coding. To facilitate preliminary coding, interviews irrelevant to the study question were excluded, and single sentences were collated into complete paragraphs to match each interview question. Merriam (2009) believes that coding is the process of the researcher reading the data, noting interesting, potentially relevant or important parts, and conducting conversations, questions and comments with the data. In this study, I adopted open coding, which implies maintaining an open mind during coding. In the first level coding, I directly coded the participants' statements, marking the original data paragraphs or sentences that fully fit the research question; in the second level coding, I read the complete data and summarize words and phrases next to the text (Merriam, 2009). Merriam (2009) mentioned that "data analysis is a complex process involving repeated switching between concrete data and abstract concepts, between inductive and deductive reasoning, and between description and interpretation" (p.176). Therefore, I moved back and forth between data fragments, descriptions, and interpretations, looking for common clues to these themes (Fraser, 2004). Due to the open encoding of the data content, 53 secondary codes were generated, which posed challenges for my subsequent topic definition and naming. By analyzing the common clues of these contents, I continue to summarize the secondary coding into seven tertiary coding themes that can directly answer, define and name the research questions. Four-level coding corresponds to the four questions in this study, each corresponding to the three-level coding topic and answering the four questions of the study .
Taking place at the Leeds Institute for Data Analytics on April 27th as part of the Leeds Digital Festival, the aim of the Vision Zero Innovation Lab is to explore ways to reduce the number of road casualties to zero in Leeds. If you would like to get involved or find out more, check out the event on eventbrite.Student Data Labs runs data-driven Innovation Labs for university students to learn practical data skills whilst working on civic problems. In the past, we have held Labs that tackle Type 2 Diabetes and health inequalities in Leeds. Student Data Labs works with an interdisciplinary team of students, data scientists, designers, researchers and software developers. We also aim to connect our Data Lab Volunteers with local employers who may be interested in employing them upon graduation. Visit our website, Twitter or Facebook for more info.The Vision Zero Innovation Lab is split into two sections - a Learning Lab and a Innovation Lab. The Learning Lab helps students learn real-world data skills - getting them up and running with tools like R as well as common data science problems as part of a team. The Innovation Lab is more experimental, where the aim is to develop ideas and data-driven tools to take on wicked problems.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The size of the Workforce Analytics Industry market was valued at USD XX Million in 2023 and is projected to reach USD XXX Million by 2032, with an expected CAGR of 15.64% during the forecast period.Workforce analytics is the collection, analysis, and interpretation of data regarding an organization's workforce in order to make better decisions and optimize human capital. Advanced analytics techniques can be used by organizations to provide valuable insights into employee performance, engagement, productivity, and other key metrics.Workforce analytics helps the organization make fact-based decisions while acquiring, retaining, developing, and compensating talent. Then the patterns that could be applied to predict future workforce needs would help solve potential problems before they arise and optimize usage from historical data analysis. Workforce analytics further allows an organization to find potential talent, measure the ROI of training programs, and assess the effectiveness of the organizational change initiatives.Using the power of workforce analytics, organizations can make their workforce much more connected, productive, and effective in conducting businesses successfully. Recent developments include: September 2022: ActivTrak partnered with Google Workspace to provide personal work insights that enable employees to improve their digital work habits and wellness. Customers can embed individual work metrics into their Google Workspace applications with ActivTrak for Google Workspace, giving employees immediate visibility to help them redesign their workday, protect focus time, and improve well-being., August 2022: ADP has launched Intelligent Self-Service, which assists employees with common issues before they need to contact their HR department for assistance. Based on an analysis of data from across ADP's ecosystem, the product employs predictive analytics and machine learning to predict which issues may arise.. Key drivers for this market are: Increasing Need to Make a Smarter a Decision About the Talent, Increasing Data in HR Departments related to Pay rolls, Recruitment. Potential restraints include: Lack of Awareness About Workforce Analytics. Notable trends are: Performance Monitoring Offers Potential Growth.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a collection of user reviews and ratings for dating applications, primarily sourced from the Google Play Store for the Indian region between 2017 and 2022. It offers valuable insights into user sentiment, evolving trends, and common feedback regarding dating apps. The data is particularly useful for practising Natural Language Processing (NLP) tasks such as sentiment analysis, topic modelling, and identifying user concerns.
The dataset is typically provided in a CSV file format. It contains a substantial number of records, estimated to be around 527,000 individual reviews. This makes it suitable for large-scale data analysis and machine learning projects. The dataset structure is tabular, with clearly defined columns for review content, metadata, and user feedback. Specific row/record counts are not exact but are indicated by the extensive range of index labels.
This dataset is ideally suited for a variety of analytical and machine learning applications: * Analysing trends in dating app usage and perception over the years. * Determining which dating applications receive more favourable responses and if this consistency has changed over time. * Identifying common issues reported by users who give low ratings (below 3/5). * Investigating the correlation between user enthusiasm and their app ratings. * Performing sentiment analysis on review texts to gauge overall user sentiment. * Developing Natural Language Processing (NLP) models for text classification, entity recognition, or summarisation. * Examining the perceived usefulness of top-rated reviews. * Understanding user behaviour and preferences across different dating apps.
The dataset primarily covers user reviews from the Google Play Store, specifically for the Indian country region ('in'), despite being titled as "all regions" in some contexts. The data spans a time range from 2017 to 2022, offering a multi-year perspective on dating app trends and user feedback. There are no specific demographic details for the reviewers themselves beyond their reviews and ratings.
CCO
This dataset is suitable for: * Data Scientists and Analysts: For conducting deep dives into user sentiment, trend analysis, and predictive modelling. * NLP Practitioners and Researchers: As a practical dataset for training and evaluating natural language processing models, especially for text classification and sentiment analysis tasks. * App Developers and Product Managers: To understand user feedback, identify areas for improvement in their own or competing dating applications, and inform product development strategies. * Market Researchers: To gain insights into the consumer behaviour and preferences within the online dating market. * Students and Beginners: It is tagged as 'Beginner' friendly, making it a good resource for those new to data analysis or NLP projects.
Original Data Source: Dating Apps Reviews 2017-2022 (all regions)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Spatial information about the seafloor is critical for decision-making by marine resource science, management and tribal organizations. Coordinating data needs can help organizations leverage collective resources to meet shared goals. To help enable this coordination, the National Oceanic and Atmospheric Administration (NOAA) National Centers for Coastal Ocean Science (NCCOS) developed a spatial framework, process and online application to identify common data collection priorities for seafloor mapping, sampling and visual surveys off the US Caribbean territories of Puerto Rico and the US Virgin Islands. Fifteen participants from local federal, state, and academic institutions entered their priorities in an online application, using virtual coins to denote their priorities in 2.5x2.5 kilometer (nearshore) and 10x10 kilometer (offshore) grid size. Grid cells with more coins were higher priorities than cells with fewer coins. Participants also reported why these locations were important and what data types were needed. Results were analyzed and mapped using statistical techniques to identify significant relationships between priorities, reasons for those priorities and data needs. Fifteen high priority locations were broadly identified for future mapping, sampling and visual surveys. These locations include: (1) a coastal location in northwest Puerto Rico (Punta Jacinto to Punta Agujereada), (2) a location approximately 11 km off Punta Agujereada, (3) coastal Rincon, (4) San Juan, (5) Punta Arenas (west of Vieques Island), (6) southwest Vieques, (7) Grappler Seamount, (8) southern Virgin Passage, (9) north St. Thomas, (10) east St. Thomas, (11) south St. John, (12) west offshore St. Croix, (13) west nearshore St. Croix, (14) east nearshore St. Croix, and (15) east offshore St. Croix. Participants consistently selected (1) Biota/Important Natural Area, (2) Commercial Fishing and (3) Coastal/Marine Hazards as their top reasons (i.e., justifications) for prioritizing locations, and (1) Benthic Habitat Map and (2) Sub-bottom Profiles as their top data or product needs. This ESRI shapefile summarizes the results from this spatial prioritization effort. This information will enable US Caribbean organization to more efficiently leverage resources and coordinate their mapping of high priority locations in the region.
This effort was funded by NOAA’s NCCOS and supported by CRCP. The overall goal of the project was to systematically gather and quantify suggestions for seafloor mapping, sampling and visual surveys in the US Caribbean territories of Puerto Rico and the US Virgin Islands. The results are will help organizations in the US Caribbean identify locations where their interests overlap with other organizations, to coordinate their data needs and to leverage collective resources to meet shared goals.
There were four main steps in the US Caribbean spatial prioritization process. The first step was to identify the technical advisory team, which included the 4 CRCP members: 2 from each the Puerto Rico and USVI regions. This advisory team recommended 33 organizations to participate in the prioritization. Each organization was then requested to designate a single representative, or respondent, who would have access to the web tool. The respondent would be responsible for communicating with their team about their needs and inputting their collective priorities. Step two was to develop the spatial framework and an online application. To do this, the US Caribbean was divided into 4 sub regions: nearshore and offshore for both Puerto Rico and USVI. The total inshore regions had 2,387 square grid cells approximately 2.5x2.5 km in size. The total offshore regions consisted of 438 square grid cells 10x10 km in size. Existing relevant spatial datasets (e.g., bathymetry, protected area boundaries, etc.) were compiled to help participants understand information and data gaps and to identify areas they wanted to prioritize for future data collections. These spatial datasets were housed in the online application, which was developed using Esri’s Web AppBuilder. In step three, this online application was used by 15 participants to enter their priorities in each subregion of interest. Respondents allocated virtual coins in the grid cells to denote their priorities for each region. Respondents were given access to all four regions, despite which territory they represented, but were not required to provide input into each region. Grid cells with more coins were higher priorities than cells with fewer coins. Participants also reported why these locations were important and what data types were needed. Coin values were standardized across the nearshore and offshore zones and used to identify spatial patterns across the US Caribbean region as a whole. The number of coins were standardized because each subregion had a different number of grid cells and participants. Standardized coin values were analyzed and mapped using statistical techniques, including hierarchical cluster analysis, to identify significant relationships between priorities, reasons for those priorities and data needs. This ESRI shapefile contains the 2.5x2.5 km and 10x10 km grid cells used in this prioritization effort and associated the standardized coin values overall, as well as by organization, justification and product. For a complete description of the process and analysis please see: Kraus et al. 2020.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Amazon Reviews Polarity Dataset discloses eighteen years of customers' ratings and reviews from Amazon.com, offering an unparalleled trove of insight and knowledge. Drawing from the immense pool of over 35 million customer reviews, this dataset presents a broad spectrum of customer opinions on products they have bought or used. This invaluable data is a gold mine for improving products and services as it contains comprehensive information regarding customers' experiences with a product including ratings, titles, and plaintext content. At the same time, this dataset contains both customer-specific data along with product information which encourages deep analytics that could lead to great advances in providing tailored solutions for customers. Has your product been favored by the majority? Are there any aspects that need extra care? Use Amazon Reviews Polarity to gain deeper insights into what your customers want - explore now!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Analyze customer ratings to identify trends: Take a look at how many customers have rated the same product or service with the same score (e.g., 4 stars). You can use this information to identify what customers like or don’t like about it by examining common sentiment throughout the reviews. Identifying these patterns can help you make decisions on which features of your products or services to emphasize in order to boost sales and satisfaction rates.
2 Review content analysis: Analyzing review content is one of the best ways to gauge customer sentiment toward specific features or aspects of a product/service. Using natural language processing tools such as Word2Vec, Latent Dirichlet Allocation (LDA), or even simple keyword search algorithms can quickly reveal general topics that are discussed in relation to your product/service across multiple reviews - allowing you quickly pinpoint areas that may need improvement for particular items within your lines of business.
3 Track associated scores over time: By tracking customer ratings overtime, you may be able to better understand when there has been an issue with something specific related to your product/service - such as negative response toward a feature that was introduced but didn’t seem popular among customers and was removed shortly after introduction.. This can save time and money by identifying issues before they become widespread concerns with larger sets of consumers who invest their money in using your company's item(s).
4 Visualize sentiment data over time graphs : Utilizing visualizations such as bar graphs can help identify trends across different categories quicker than raw numbers alone; combining both numeric values along with color differences associated between different scores allows you spot anomalies easier - allowing faster resolution times when trying figure out why certain spikes occurred where other stayed stable (or vice-versa) when comparing similar data points through time-series based visualization models
- Developing a customer sentiment analysis system that can be used to quickly analyze the sentiment of reviews and identify any potential areas of improvement.
- Building a product recommendation service that takes into account the ratings and reviews of customers when recommending similar products they may be interested in purchasing.
- Training a machine learning model to accurately predict customers’ ratings on new products they have not yet tried and leverage this for further product development optimization initiatives
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------| | label | The sentiment of the review, either positive or negative. (String) | | title | The title of the review. (String) ...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Due to the large amounts of text generated by government agencies and policymakers, computer-assisted text-as-data methods are starting to become more popular for scholars of public administration, public policy, and political science, as they allow for much faster processing of large amounts of textual data. Here, I review several of the more common text-as-data methods and provide an overview of their applicability to different data structures and substantive questions in public administration. Then, using thousands of documents issued by the Centers for Medicare & Medicaid Services and its predecessor agency—the Health Care Financing Administration—I showcase the utility of topic models by illustrating how they can be used in conjunction with other politically-relevant covariates to help explain changes in agency priorities. I then conclude by discussing other possible uses for computational text analysis methods in public administration.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This dataset offers a focused and invaluable window into user perceptions and experiences with applications listed on the Apple App Store. It is a vital resource for app developers, product managers, market analysts, and anyone seeking to understand the direct voice of the customer in the dynamic mobile app ecosystem.
Dataset Specifications:
Last crawled:
(This field is blank in your provided info, which means its recency is currently unknown. If this were a real product, specifying this would be critical for its value proposition.)Richness of Detail (11 Comprehensive Fields):
Each record in this dataset provides a detailed breakdown of a single App Store review, enabling multi-dimensional analysis:
Review Content:
review
: The full text of the user's written feedback, crucial for Natural Language Processing (NLP) to extract themes, sentiment, and common keywords.title
: The title given to the review by the user, often summarizing their main point.isEdited
: A boolean flag indicating whether the review has been edited by the user since its initial submission. This can be important for tracking evolving sentiment or understanding user behavior.Reviewer & Rating Information:
username
: The public username of the reviewer, allowing for analysis of engagement patterns from specific users (though not personally identifiable).rating
: The star rating (typically 1-5) given by the user, providing a quantifiable measure of satisfaction.App & Origin Context:
app_name
: The name of the application being reviewed.app_id
: A unique identifier for the application within the App Store, enabling direct linking to app details or other datasets.country
: The country of the App Store storefront where the review was left, allowing for geographic segmentation of feedback.Metadata & Timestamps:
_id
: A unique identifier for the specific review record in the dataset.crawled_at
: The timestamp indicating when this particular review record was collected by the data provider (Crawl Feeds).date
: The original date the review was posted by the user on the App Store.Expanded Use Cases & Analytical Applications:
This dataset is a goldmine for understanding what users truly think and feel about mobile applications. Here's how it can be leveraged:
Product Development & Improvement:
review
text to identify recurring technical issues, crashes, or bugs, allowing developers to prioritize fixes based on user impact.review
text to inform future product roadmap decisions and develop features users actively desire.review
field.rating
and sentiment
after new app updates to assess the effectiveness of bug fixes or new features.Market Research & Competitive Intelligence:
Marketing & App Store Optimization (ASO):
review
and title
fields to gauge overall user satisfaction, pinpoint specific positive and negative aspects, and track sentiment shifts over time.rating
trends and identify critical reviews quickly to facilitate timely responses and proactive customer engagement.Academic & Data Science Research:
review
and title
fields are excellent for training and testing NLP models for sentiment analysis, topic modeling, named entity recognition, and text summarization.rating
distribution, isEdited
status, and date
to understand user engagement and feedback cycles.country
-specific reviews to understand regional differences in app perception, feature preferences, or cultural nuances in feedback.This App Store Reviews dataset provides a direct, unfiltered conduit to understanding user needs and ultimately driving better app performance and greater user satisfaction. Its structured format and granular detail make it an indispensable asset for data-driven decision-making in the mobile app industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Modern technologies such as the Internet of Things (IoT) play a key role in Smart Manufacturing and Business Process Management (BPM). In particular, process mining benefits from enriched event logs that incorporate physical sensor data. This dataset presents an IoT-enriched XES event log recorded in a physical smart factory environment. It builds upon the previously published dataset “An IoT-Enriched Event Log for Process Mining in Smart Factories” (available on Zenodo) and follows the DataStream XES extension. In this modified version, three types of common Data Quality Issues (DQIs) - missing sensor values, missing sensors, and time shifts - have been artificially injected into the sensor data. These issues reflect realistic challenges in industrial IoT data processing and are valuable for developing and testing robust data cleaning and analysis methods.
By comparing the original (clean) dataset with this modified version, researchers can systematically evaluate DQI detection, handling, and solving techniques under controlled conditions. Further details are provided for each of three DQI types in the subfolders in a csv changelog.
The MSP Data Study, undertaken on behalf of DG MARE between February and December 2016, presents an overview of what data and knowledge are needed by Member States for MSP decision making, taking into account different scales and different points in the MSP cycle. It examines current and future MSP data and knowledge issues from various perspectives (i.e. from Member States, Sea Basin(s) as well as projects and other relevant initiatives) in order to identify: - What data is available for MSP purposes and what data is actually used for MSP; - Commonalities in MSP projects and Member State experiences; - The potential for EMODnet sea basin portals to help coordination of MSP at a regional level and options for realising marine spatial data infrastructures to implement MSP; - Potential revisions to be made concerning INSPIRE specifications for MSP purposes. The study finds that across all European Sea Basins, countries are encountering similar issues with respect to MSP data needs. Differences are found in the scope of activities and sea uses between Member States and Sea Basins and the type of planning that is being carried out. Common data gaps include socio-economic data for different uses and socio-cultural information. By and large, data and information gaps are not so much about what data is missing but more about how to aggregate and interpret data in order to acquire the information needed by a planner. Challenges for Member States lie in developing second generation plans which require more analytical information and strategic evidence. Underlying this is the need for spatial evaluation tools for assessment, impact and conflict analysis purposes. Transnational MSP data needs are different to national MSP data needs. While the scope and level of detail of data needed is typically much simpler, ensuring its coherence and harmonisation across boundaries remains a challenge. Pan-European initiatives, such as the EMODnet data portals and Sea Basin Checkpoints have the potential to support transboundary MSP data exchange needs by providing access to a range of harmonised data sets across European Sea Basins and testing the availability and adequacy of existing data sets to meet commercial and policy challenges
WiserBrand's Comprehensive Customer Call Transcription Dataset: Tailored Insights
WiserBrand offers a customizable dataset comprising transcribed customer call records, meticulously tailored to your specific requirements. This extensive dataset includes:
User ID and Firm Name: Identify and categorize calls by unique user IDs and company names. Call Duration: Analyze engagement levels through call lengths. Geographical Information: Detailed data on city, state, and country for regional analysis. Call Timing: Track peak interaction times with precise timestamps. Call Reason and Group: Categorised reasons for calls, helping to identify common customer issues. Device and OS Types: Information on the devices and operating systems used for technical support analysis. Transcriptions: Full-text transcriptions of each call, enabling sentiment analysis, keyword extraction, and detailed interaction reviews.
Our dataset is designed for businesses aiming to enhance customer service strategies, develop targeted marketing campaigns, and improve product support systems. Gain actionable insights into customer needs and behavior patterns with this comprehensive collection, particularly useful for Consumer Data, Consumer Behavior Data, Consumer Sentiment Data, Consumer Review Data, AI Training Data, Textual Data, and Transcription Data applications.
WiserBrand's dataset is essential for companies looking to leverage Consumer Data and B2B Marketing Data to drive their strategic initiatives in the English-speaking markets of the USA, UK, and Australia. By accessing this rich dataset, businesses can uncover trends and insights critical for improving customer engagement and satisfaction.
Cases:
Enriching STT Models: The dataset includes a wide variety of real-world customer service calls with diverse accents, tones, and terminologies. This makes it highly valuable for training speech-to-text models to better recognize different dialects, regional speech patterns, and industry-specific jargon. It could help improve accuracy in transcribing conversations in customer service, sales, or technical support.
Contextualized Speech Recognition: Given the contextual information (e.g., reasons for calls, call categories, etc.), it can help models differentiate between various types of conversations (technical support vs. sales queries), which would improve the model’s ability to transcribe in a more contextually relevant manner.
Improving TTS Systems: The transcriptions, along with their associated metadata (such as call duration, timing, and call reason), can aid in training Text-to-Speech models that mimic natural conversation patterns, including pauses, tone variation, and proper intonation. This is especially beneficial for developing conversational agents that sound more natural and human-like in their responses.
Noise and Speech Quality Handling: Real-world customer service calls often contain background noise, overlapping speech, and interruptions, which are crucial elements for training speech models to handle real-life scenarios more effectively.
Customer Interaction Simulation: The transcriptions provide a comprehensive view of real customer interactions, including common queries, complaints, and support requests. By training AI models on this data, businesses can equip their virtual agents with the ability to understand customer concerns, follow up on issues, and provide meaningful solutions, all while mimicking human-like conversational flow.
Sentiment Analysis and Emotional Intelligence: The full-text transcriptions, along with associated call metadata (e.g., reason for the call, call duration, and geographical data), allow for sentiment analysis, enabling AI agents to gauge the emotional tone of customers. This helps the agents respond appropriately, whether it’s providing reassurance during frustrating technical issues or offering solutions in a polite, empathetic manner. Such capabilities are essential for improving customer satisfaction in automated systems.
Customizable Dialogue Systems: The dataset allows for categorizing and identifying recurring call patterns and issues. This means AI agents can be trained to recognize the types of queries that come up frequently, allowing them to automate routine tasks such as ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a collection of real-world industrial screw driving datasets, designed to support research in manufacturing process monitoring, anomaly detection, and quality control. Each dataset represents different aspects and challenges of automated screw driving operations, with a focus on natural process variations and degradation patterns.
Scenario name | Number of work pieces used in the experiments | Repetitions (screw cylces) per workpiece | Individual screws per workpiece | Total number of observations | Number of unique classes | Purpose |
S01_thread-degradation | 100 | 25 | 2 | 5.000 | 1 | Investigation of thread degradation through repeated fastening |
S02_surface-friction | 250 | 25 | 2 | 12.500 | 8 | Surface friction effects on screw driving operations |
S03_error-collection-1 | 1 | 2 | >20 | |||
S04_error-collection-2 | 2.500 | 1 | 2 | 5.000 | 25 |
The datasets were collected from operational industrial environments, specifically from automated screw driving stations used in manufacturing. Each scenario investigates specific mechanical phenomena that can occur during industrial screw driving operations:
1. S01_thread-degradation
2. S02_surface-friction
3. S03_screw-error-collection-1 (recorded but unpublished)
4. S04_screw-error-collection-2 (recorded but unpublished)
5. S05_upper-workpiece-manipulations (recorded but unpublished)
6. S06_lower-workpiece-manipulations (recorded but unpublished)
Additional scenarios may be added to this collection as they become available.
Each dataset follows a standardized structure:
These datasets are suitable for various research purposes:
These datasets are provided under an open-access license to support research and development in manufacturing analytics. When using any of these datasets, please cite the corresponding publication as detailed in each dataset's README file.
We recommend using our library PyScrew to load and prepare the data. However, the the datasets can be processed using standard JSON and CSV processing libraries. Common data analysis and machine learning frameworks may be used for the analysis. The .tar file provided all information required for each scenario.
Each dataset includes:
For questions, issues, or collaboration interests regarding these datasets, please:
These datasets were collected and prepared from:
The research was supported by:
The Metabolomics workshop on experimental and data analysis training for untargeted metabolomics was hosted by the Proteomics Society of India in December 2019. The Workshop included six tutorial lectures and hands-on data analysis training sessions presented by seven speakers. The tutorials and hands-on data analysis sessions focused on workflows for liquid chromatography-mass spectrometry (LC-MS) based on untargeted metabolomics. We review here three main topics from the workshop which were uniquely identified as bottlenecks for new researchers: a) experimental design, b) quality controls during sample preparation and instrumental analysis and c) data quality evaluation. Our objective here is to present common challenges faced by novice researchers and present possible guidelines and resources to address them. We provide resources and good practices for researchers who are at the initial stage of setting up metabolomics workflows in their labs.Complete detailed metabolomics/lipidomics protocols are available online at EMBL-MCF protocol including video tutorials.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides customer reviews for Apple iPhones, sourced from Amazon. It is designed to facilitate in-depth analysis of user feedback, enabling insights into product sentiment, feature performance, and underlying discussion themes. The dataset is ideal for understanding customer satisfaction and market trends related to iPhone products.
The dataset is typically provided in a CSV file format. While specific record counts are not available, data points related to verified purchasers indicate over 3,000 entries. The dataset's quality is rated as 5 out of 5.
This dataset is well-suited for various analytical projects, including: * Sentiment analysis: To determine overall sentiment and identify trends in customer opinions. * Feature analysis: To analyse user satisfaction with specific iPhone features. * Topic modelling: To discover underlying themes and common discussion points within customer reviews. * Exploratory Data Analysis (EDA): For initial investigations and pattern discovery. * Natural Language Processing (NLP) tasks: For text analysis and understanding.
The dataset has a global regional coverage. While a specific time range for the reviews is not detailed, the dataset itself was listed on 08/06/2025.
CCO
Original Data Source: Apple IPhone Customer Reviews
THIS RESOURCE IS NO LONGER IN SERVICE, documented May 10, 2017. A pilot effort that has developed a centralized, web-based biospecimen locator that presents biospecimens collected and stored at participating Arizona hospitals and biospecimen banks, which are available for acquisition and use by researchers. Researchers may use this site to browse, search and request biospecimens to use in qualified studies. The development of the ABL was guided by the Arizona Biospecimen Consortium (ABC), a consortium of hospitals and medical centers in the Phoenix area, and is now being piloted by this Consortium under the direction of ABRC. You may browse by type (cells, fluid, molecular, tissue) or disease. Common data elements decided by the ABC Standards Committee, based on data elements on the National Cancer Institute''s (NCI''s) Common Biorepository Model (CBM), are displayed. These describe the minimum set of data elements that the NCI determined were most important for a researcher to see about a biospecimen. The ABL currently does not display information on whether or not clinical data is available to accompany the biospecimens. However, a requester has the ability to solicit clinical data in the request. Once a request is approved, the biospecimen provider will contact the requester to discuss the request (and the requester''s questions) before finalizing the invoice and shipment. The ABL is available to the public to browse. In order to request biospecimens from the ABL, the researcher will be required to submit the requested required information. Upon submission of the information, shipment of the requested biospecimen(s) will be dependent on the scientific and institutional review approval. Account required. Registration is open to everyone.. Documented on August 26, 2019.Database of published microarray gene expression data, and a software tool for comparing that published data to a user''''s own microarray results. It is very simple to use - all you need is a web browser and a list of the probes that went up or down in your experiment. If you find L2L useful please consider contributing your published data to the L2L Microarray Database in the form of list files. L2L finds true biological patterns in gene expression data by systematically comparing your own list of genes to lists of genes that have been experimentally determined to be co-expressed in response to a particular stimulus - in other words, published lists of microarray results. The patterns it finds can point to the underlying disease process or affected molecular function that actually generated the observed changed in gene expression. Its insights are far more systematic than critical gene analyses, and more biologically relevant than pure Gene Ontology-based analyses. The publications included in the L2L MDB initially reflected topics thought to be related to Cockayne syndrome: aging, cancer, and DNA damage. Since then, the scope of the publications included has expanded considerably, to include chromatin structure, immune and inflammatory mediators, the hypoxic response, adipogenesis, growth factors, hormones, cell cycle regulators, and others. Despite the parochial origins of the database, the wide range of topics covered will make L2L of general interest to any investigator using microarrays to study human biology. In addition to the L2L Microarray Database, L2L contains three sets of lists derived from Gene Ontology categories: Biological Process, Cellular Component, and Molecular Function. As with the L2L MDB, each GO sub-category is represented by a text file that contains annotation information and a list of the HUGO symbols of the genes assigned to that sub-category or any of its descendants. You don''''t need to download L2L to use it to analyze your microarray data. There is an easy-to-use web-based analysis tool, and you have the option of downloading your results so you can view them at any time on your own computer, using any web browser. However, if you prefer, the entire L2L project, and all of its components, can be downloaded from the download page. Platform: Online tool, Windows compatible, Mac OS X compatible, Linux compatible, Unix compatible
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Objectives: In quantitative research, understanding basic parameters of the study population is key for interpretation of the results. As a result, it is typical for the first table ("Table 1") of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers.
Materials and Methods: The tableone package is developed following good practice guidelines for scientific computing and all code is made available under a permissive MIT License. A testing framework runs on a continuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged.
Results: The tableone software package automatically compiles summary statistics into publishable formats such as CSV, HTML, and LaTeX. An executable Jupyter Notebook demonstrates application of the package to a subset of data from the MIMIC-III database. Tests such as Tukey's rule for outlier detection and Hartigan's Dip Test for modality are computed to highlight potential issues in summarizing the data.
Discussion and Conclusion: We present open source software for researchers to facilitate carrying out reproducible studies in Python, an increasingly popular language in scientific research. The toolkit is intended to mature over time with community feedback and input. Development of a common tool for summarizing data may help to promote good practice when used as a supplement to existing guidelines and recommendations. We encourage use of tableone alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling. We also suggest seeking guidance from a statistician when using tableone for a research study, especially prior to submitting the study for publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used for the research presented in the following paper: Takayuki Hiraoka, Takashi Kirimura, Naoya Fujiwara (2024) "Geospatial analysis of toponyms in geo-tagged social media posts".
We collected georeferenced Twitter posts tagged to coordinates inside the bounding box of Japan between 2012-2018. The present dataset represents the spatial distributions of all geotagged posts as well as posts containing in the text each of 24 domestic toponyms, 12 common nouns, and 6 foreign toponyms. The code used to analyze the data is available on GitHub.
selected_geotagged_tweet_data/
: Number of geotagged twitter posts in each grid cell. Each csv file under this directory associates each grid cell (spanning 30 seconds of latitude and 45 secoonds of longitude, which is approximately a 1km x 1km square, specified by an 8 digit code m3code
) with the number of geotagged tweets tagged to the coordinates inside that cell (tweetcount
). file_names.json
relates each of the toponyms studied in this work to the corresponding datafile (all
denotes the full data). population/population_center_2020.xlsx
: Center of population of each municipality based on the 2020 census. Derived from data published by the Statistics Bureau of Japan on their website (Japanese)population/census2015mesh3_totalpop_setai.csv
: Resident population in each grid cell based on the 2015 census. Derived from data published by the Statistics Bureau of Japan on e-stat (Japanese)population/economiccensus2016mesh3_jigyosyo_jugyosya.csv
: Employed population in each grid cell based on the 2016 Economic Census. Derived from data published by the Statistics Bureau of Japan on e-stat (Japanese)japan_MetropolitanEmploymentArea2015map/
: Shape file for the boundaries of Metropolitan Employment Areas (MEA) in Japan. See this website for details of MEA.ward_shapefiles/
: Shape files for the boundaries of wards in large cities, published by the Statistics Bureau of Japan on e-statAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We include the course syllabus used to teach quantitative research design and analysis methods to graduate Linguistics students using a blended teaching and learning approach. The blended course took place over two weeks and builds on a face to face course presented over two days in 2019. Students worked through the topics in preparation for a live interactive video session each Friday to go through the activities. Additional communication took place on Slack for two hours each week. A survey was conducted at the start and end of the course to ascertain participants' perceptions of the usefulness of the course. The links to online elements and the evaluations have been removed from the uploaded course guide.Participants who complete this workshop will be able to:- outline the steps and decisions involved in quantitative data analysis of linguistic data- explain common statistical terminology (sample, mean, standard deviation, correlation, nominal, ordinal and scale data)- perform common statistical tests using jamovi (e.g. t-test, correlation, anova, regression)- interpret and report common statistical tests- describe and choose from the various graphing options used to display data- use jamovi to perform common statistical tests and graph resultsEvaluationParticipants who complete the course will use these skills and knowledge to complete the following activities for evaluation:- analyse the data for a project and/or assignment (in part or in whole)- plan the results section of an Honours research project (where applicable)Feedback and suggestions can be directed to M Schaefer schaemn@unisa.ac.za