https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Instagram Multi-Class Fake Account Dataset (IMFAD) contains four account types: physical, spam, fraud, and bot accounts, making it suitable for multi-class classification tasks.
2) Data Utilization (1) Instagram Multi-Class Fake Account Dataset (IMFAD) has characteristics that: • It was created as part of a fake Instagram account detection research project. • All data has been anonymized and refined to be made public safely. (2) Instagram Multi-Class Fake Account Dataset (IMFAD) can be used to: • Machine Learning Model Study: Helps students, researchers, and developers build models for fake account detection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Comparative Political Economy Database (CPEDB) began at the Centre for Learning, Social Economy and Work (CLSEW) at the Ontario Institute for Studies in Education at the University of Toronto (OISE/UT) as part of the Changing Workplaces in a Knowledge Economy (CWKE) project. This data base was initially conceived and developed by Dr. Wally Seccombe (independent scholar) and Dr. D.W. Livingstone (Professor Emeritus at the University of Toronto). Seccombe has conducted internationally recognized historical research on evolving family structures of the labouring classes (A Millennium of Family Change: Feudalism to Capitalism in Northwestern Europe and Weathering the Storm: Working Class Families from the Industrial Revolution to the Fertility Decline). Livingstone has conducted decades of empirical research on class and labour relations. A major part of this research has used the Canadian Class Structure survey done at the Institute of Political Economy (IPE) at Carleton University in 1982 as a template for Canadian national surveys in 1998, 2004, 2010 and 2016, culminating in Tipping Point for Advanced Capitalism: Class, Class Consciousness and Activism in the ‘Knowledge Economy’ (https://fernwoodpublishing.ca/book/tipping-point-for-advanced-capitalism) and a publicly accessible data base including all five of these Canadian surveys (https://borealisdata.ca/dataverse/CanadaWorkLearningSurveys1998-2016). Seccombe and Livingstone have collaborated on a number of research studies that recognize the need to take account of expanded modes of production and reproduction. Both Seccombe and Livingstone are Research Associates of CLSEW at OISE/UT. The CPEDB Main File (an SPSS data file) covers the following areas (in order): demography, family/household, class/labour, government, electoral democracy, inequality (economic, political & gender), health, environment, internet, macro-economic and financial variables. In its present form, it contains annual data on 725 variables from 12 countries (alphabetically listed): Canada, Denmark, France, Germany, Greece, Italy, Japan, Norway, Spain, Sweden, United Kingdom and United States. A few of the variables date back to 1928, and the majority date from 1960 to 1990. Where these years are not covered in the source, a minority of variables begin with more recent years. All the variables end at the most recent available year (1999 to 2022). In the next version developed in 2025, the most recent years (2023 and 2024) will be added whenever they are present in the sources’ datasets. For researchers who are not using SPSS, refer to the Chart files for overviews, summaries and information on the dataset. For a current list of the variable names and their labels in the CPEDB data base, see the excel file: Outline of SPSS file Main CPEDB, Nov 6, 2023. At the end of each variable label in this file and the SPSS datafile, you will find the source of that variable in a bracket. If I have combined two variables from a given source, the bracket will begin with WS and then register the variables combined. In the 14 variables David created at the beginning of the Class Labour section, you will find DWL in these brackets with his description as to how it was derived. The CPEDB’s variables have been derived from many databases; the main ones are OECD (their Statistics and Family Databases), World Bank, ILO, IMF, WHO, WIID (World Income Inequality Database), OWID (Our World in Data), Parlgov (Parliaments and Governments Database), and V-Dem (Varieties of Democracy). The Institute for Political Economy at Carleton University is currently the main site for continuing refinement of the CPEDB. IPE Director Justin Paulson and other members are involved along with Seccombe and Livingstone in further development and safe storage of this updated database both at the IPE at Carleton and the UT dataverse. All those who explore the CPEDB are invited to share their perceptions of the entire database, or any of its sections, with Seccombe generally (wseccombe@sympatico.ca) and Livingstone for class/labour issues (davidlivingstone@utoronto.ca). They welcome any suggestions for additional variables together with their data sources. A new version CPEDB will be created in the spring of 2025 and installed as soon as the revision is completed. This revised version is intended to be a valuable resource for researchers in all of the included countries as well as Canada.
Despite renewed interest in social class,very little is known about the meaning of class membership in twenty-first century Britain. This project aims to fill a growing gap in sociological research and political understanding by documenting the ways in which the deepest layers of everyday life are differentiated by social class. This includes: the use of space and time; daily routines and rhythms of life; geographical mobility; roles and activities in work and in the domestic sphere. The latter will cover the household division of labour, relations with children and schoolwork, leisure activities and mealtimes. To capture all this, the project will involve intensive study of some twenty family households in Bristol. The interest is in 'ordinary' representatives of the class structure rather than the most marginalised, so participants will be households in which at least one adult has full-time work and at least one child is living at home. Households will be contacted through a randomised mailout to selected areas in Bristol and suitable participants will be selected. The project will deploy an innovative mix of research methods, including qualitative time-diaries, observation, photographic methods and interviews, to document the most taken-for-granted elements of their routine everyday lives.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study aimed to identify latent classes based on the characteristics of the neighborhood environment perceived by adolescents and their association with sex, socioeconomic status, body composition and movement behaviors.
The file presents information related to the characteristics of the neighborhood environment are derived from the Neighborhood Walkability for Youth Scale (NEWS-Y), which were used in the LCA model to create latent classes about the adolescents' neighborhood environment. It also presents information regarding movement behaviors (physical activity, sedentary behavior and sleep) derived from accelerometry data; screen times and total sitting time information obtained by questionnaires. In addition to sociodemographic information (age, sex, socioeconomic status) and body composition.
From the neighborhood data it was possible to find an LCA model with three classes were recognized: class 1, "Best Perceived Environment"; class 2, "Moderate Perceived Environment" and "Worst Perceived Environment". Then, the associations of the latent classes found with the variables measured and described above were tested.
The results demonstrate that only Light physical activity, total sitting time, and socioeconomic status were associated with class prevalence. The findings highlight the influence of neighborhood classes on adolescents' Light physical activity and total sitting time.
No description was included in this Dataset collected from the OSF
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy.
Methods
Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification.
Data Source Collection
Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.
Data Preprocessing:
Data preprocessing shall be done to remove noise and outlier.
Transformation:
The data shall be transformed from analog to electronic record.
Data Partitioning
The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.
The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics.
Classification and prediction:
Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows:
i. Data collection and preprocessing shall be done.
ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification.
iii. Test data set is shall be stored in database test data set.
iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows:
Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
This survey collected data to generate a comprehensive review of the economic and policy status of the recreational for-hire sector in the U.S. Gulf of Mexico, including charter, head, and guide boats. The survey created a socioeconomic dataset that can be used to analyze future economic, environmental, and policy questions, including those related to natural disturbances and the ongoing regulation of resource utilization in the Gulf. The specific project objectives included a) collecting economic, social, and policy data for all segments of the for-hire sector b) identifying groups of respondents with relatively homogeneous characteristics, thereby defining operational classes that may be the focus of targeted, management-based economic and policy analysis and c) constructing costs, earnings, and attitudinal profiles by operational class and state/region. The survey was conducted by mail, internet, and in-person interviews in 2010.
The Great Britain Historical Database has been assembled as part of the ongoing Great Britain Historical GIS Project. The project aims to trace the emergence of the north-south divide in Britain and to provide a synoptic view of the human geography of Britain at sub-county scales. Further information about the project is available on A Vision of Britain webpages, where users can browse the database's documentation system online.
These data were originally collected by the Censuses of Population for England and Wales, and for Scotland. They were computerised by the Great Britain Historical GIS Project and its collaborators. They form part of the Great Britain Historical Database, which contains a wide range of geographically-located statistics, selected to trace the emergence of the north-south divide in Britain and to provide a synoptic view of the human geography of Britain, generally at sub-county scales.
The first census report to tabulate social class was 1951, but this collection also includes a table from the Registrar-General's 1931 Decennial Supplement which drew on census occupational data to tabulate social class by region. In 1961 and 1971 the census used a more detailed classification of Socio-Economic Groups, from which the five Social Classes are a simplification.
This is a new edition. Data from the Census of Scotland have been added for 1951, 1961 and 1971. Wherever possible, ID numbers have been added for counties and districts which match those used in the digital boundary data created by the GBH GIS, greatly simplifying mapping.
There is a requirement that public authorities, like Ofsted, must publish updated versions of datasets that are disclosed as a result of Freedom of Information requests.
Some information which is requested is exempt from disclosure to the public under the Freedom of Information Act; it is therefore not appropriate for this information to be made available. Examples of information which it is not appropriate to make available include the locations of women’s refuges, some military bases and all children’s homes and the personal data of providers and staff. Ofsted also considers that the names and addresses of registered childminders are their personal data, and it is not appropriate to make these publicly available unless those individuals have given their explicit consent to do so. This information has therefore not been included.
This dataset contains information on independent fostering agencies and voluntary adoption agencies in England.
MS Excel Spreadsheet, 200 KB
This file may not be suitable for users of assistive technology.
Request an accessible format.Date of next update: April 2017
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/B9TEWMhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/B9TEWM
This dataset contains replication files for "The Fading American Dream: Trends in Absolute Income Mobility Since 1940" by Raj Chetty, David Grusky, Maximilian Hell, Nathaniel Hendren, Robert Manduca, and Jimmy Narang. For more information, see https://opportunityinsights.org/paper/the-fading-american-dream/. A summary of the related publication follows. One of the defining features of the “American Dream” is the ideal that children have a higher standard of living than their parents. We assess whether the U.S. is living up to this ideal by estimating rates of “absolute income mobility” – the fraction of children who earn more than their parents – since 1940. We measure absolute mobility by comparing children’s household incomes at age 30 (adjusted for inflation using the Consumer Price Index) with their parents’ household incomes at age 30. We find that rates of absolute mobility have fallen from approximately 90% for children born in 1940 to 50% for children born in the 1980s. Absolute income mobility has fallen across the entire income distribution, with the largest declines for families in the middle class. These findings are unaffected by using alternative price indices to adjust for inflation, accounting for taxes and transfers, measuring income at later ages, and adjusting for changes in household size. Absolute mobility fell in all 50 states, although the rate of decline varied, with the largest declines concentrated in states in the industrial Midwest, such as Michigan and Illinois. The decline in absolute mobility is especially steep – from 95% for children born in 1940 to 41% for children born in 1984 – when we compare the sons’ earnings to their fathers’ earnings. Why have rates of upward income mobility fallen so sharply over the past half-century? There have been two important trends that have affected the incomes of children born in the 1980s relative to those born in the 1940s and 1950s: lower Gross Domestic Product (GDP) growth rates and greater inequality in the distribution of growth. We find that most of the decline in absolute mobility is driven by the more unequal distribution of economic growth rather than the slowdown in aggregate growth rates. When we simulate an economy that restores GDP growth to the levels experienced in the 1940s and 1950s but distributes that growth across income groups as it is distributed today, absolute mobility only increases to 62%. In contrast, maintaining GDP at its current level but distributing it more broadly across income groups – at it was distributed for children born in the 1940s – would increase absolute mobility to 80%, thereby reversing more than two-thirds of the decline in absolute mobility. These findings show that higher growth rates alone are insufficient to restore absolute mobility to the levels experienced in mid-century America. Under the current distribution of GDP, we would need real GDP growth rates above 6% per year to return to rates of absolute mobility in the 1940s. Intuitively, because a large fraction of GDP goes to a small fraction of high-income households today, higher GDP growth does not substantially increase the number of children who earn more than their parents. Of course, this does not mean that GDP growth does not matter: changing the distribution of growth naturally has smaller effects on absolute mobility when there is very little growth to be distributed. The key point is that increasing absolute mobility substantially would require more broad-based economic growth. We conclude that absolute mobility has declined sharply in America over the past half-century primarily because of the growth in inequality. If one wants to revive the “American Dream” of high rates of absolute mobility, one must have an interest in growth that is shared more broadly across the income distribution.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1: Stata do-file to generate WIR and TWIR figures.
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .
Citation
Please cite our work as
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.
Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Task 3
ID- Unique identifier of the news article
Title- Title of the news article
text- Text mentioned inside the news article
our rating - class of the news article as false, partially false, true, other
Output data format
Task 3
public_id- Unique identifier of the news article
predicted_rating- predicted class
Sample File
public_id, predicted_rating 1, false 2, true
Sample file
public_id, predicted_domain 1, health 2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:
Fakenews Classification Datasets
Fake News Detection Challenge KDD 2020
FakeNewsNet
IMPORTANT!
We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Submission Link: Coming soon
Related Work
Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.
Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.
Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Females Aged 15 - 44 Years by Whether or Not They Have Had Children by Social Class, Aggregate Town or Rural Area, CensusYear and Statistic
View data using web pages
Download .px file (Software required)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Titanic Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yasserh/titanic-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
https://raw.githubusercontent.com/Masterx-AI/Project_Titanic_Survival_Prediction_/main/titanic.jpg" alt="">
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
This dataset has been referred from Kaggle: https://www.kaggle.com/c/titanic/data.
--- Original source retains full ownership of the source dataset ---
Please be advised that there are issues with the Small Area boundary dataset generalised to 20m which affect Small Area 268014010 in Ballygall D, Dublin City. The Small Area boundary dataset generalised to 20m is in the process of being revised and the updated datasets will be available as soon as the boundaries are amended. This feature layer was created using Census 2016 data produced by the Central Statistics Office (CSO) and Small Areas national boundary data (generalised to 20m) produced by Tailte Éireann. The layer represents Census 2016 theme 9.1, population aged 15+ by sex and social class. Attributes include population breakdown by social class and sex (e.g. skilled manual - males, non-manual - females). Census 2016 theme 9 represents Social Class and Socio-Economic Group. The Census is carried out every five years by the CSO to determine an account of every person in Ireland. The results provide information on a range of themes, such as, population, housing and education. The data were sourced from the CSO. The Small Area Boundaries were created with the following credentials. National boundary dataset. Consistent sub-divisions of an ED. Created not to cross some natural features. Defined area with a minimum number of GeoDirectory building address points. Defined area initially created with minimum of 65 – approx. average of around 90 residential address points. Generated using two bespoke algorithms which incorporated the ED and Townland boundaries, ortho-photography, large scale vector data and GeoDirectory data. Before the 2011 census they were split in relation to motorways and dual carriageways. After the census some boundaries were merged and other divided to maintain privacy of the residential area occupants. They are available as generalised and non generalised boundary sets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset has been constructed from the Scottish Young People's Survey (SYPS) and Scottish School Leavers Survey (SSLS) as part of research project - Education and Youth Transitions in England, Wales and Scotland 1984-2002 - funded by the Economic and Social Research Council (R000239852). A key part of the project was to create comparable time-series datasets, comparable over time and for England, Wales and Scotland, from the Youth Cohort Study (YCS) and Scottish cohort surveys. All the datasets constructed for this project are available from the UK Data Service, study number SN 5765. The Scottish youth cohort trends dataset included in Datashare includes one additional cohort.
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
https://www.icpsr.umich.edu/web/ICPSR/studies/27804/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/27804/terms
This special topic poll, fielded September 10, 2009, re-interviewed 648 adults first surveyed August 27-31 2009. This continuing series of monthly surveys solicit public opinion on the presidency and on a range of other political and social issues. The dataset includes their responses to call-back questions as well as to selected questions in the original poll (ICPSR 27803) which asked whether they approved of the way Barack Obama was handling the presidency, the war in Afghanistan, health care, and the economy. Several questions addressed health care, including whether respondents thought the health care system in the United States worked well, whether Medicare worked well, and whether the government would do a better job than private health care companies in keeping health care costs down and providing medical coverage. Respondents were also asked their opinions on whether President Obama's proposals for reform would increase competition in the private insurance market, the health insurance industry, whether they believed in the possibility of expanding health care coverage without increasing budget deficits or taxes on the middle class, whether President Obama or the Republicans in Congress had better ideas about reforming the health care system, and whether they understood the health care reforms that Congress was considering. Whether President Obama's proposals for reform would increase competition in the private insurance market, whether the health care reform proposed by President Obama would make health care better in the United States and would help the respondent personally, and whether respondents favored the ideas of requiring all Americans to buy health insurance and the government offering everyone a government administered health insurance plan. Information was collected on how respondents thought health care reforms under consideration in Congress would effect the middle class, senior citizens, small businesses, the respondent personally, their health care costs, and the quality of health care. Additional topics that were covered included the pullout of troops from Iraq, credit card debt, how the federal government should use taxpayer's money, personal finances, the best way to discourage obesity, terrorist attacks, the war in Afghanistan, the swine flu, and job security. Respondents were re-interviewed on September 10, 2009, and asked whether they approved of the way Barak Obama was handling health care, if they had listened to the president's address of September 9th, the clarity of his explanation in regard to reform, if they agreed with the proposed reforms, whether Congress would pass and President Obama would sign a bill reforming the system. Questions in regard to budget deficit, expanded health care, regulation of the health insurance industry were also asked. Demographic variables include sex, age, race, marital status, education level, household income, political party affiliation, political philosophy, perceived social class, religious preference, and voter registration status and participation history.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset pertains to the sinking of the RMS Titanic, one of the most infamous shipwrecks in history. On 15 April 1912, during its maiden voyage, the Titanic struck an iceberg and sank, leading to the deaths of 1,502 out of 2,224 passengers and crew due to an insufficient number of lifeboats. While luck played a role, certain groups of people demonstrated a higher likelihood of survival. The primary goal for users of this dataset is to construct a predictive model that identifies the types of individuals who were more likely to survive, utilising passenger details such as name, age, gender, and socio-economic class. Additionally, the objective involves understanding and preparing the dataset, building robust classification models, fine-tuning their hyperparameters, and comparing various algorithm evaluation metrics.
The dataset contains the following columns: * PassengerId: A unique identifier for each passenger. * Survived: Indicates whether the passenger survived (1) or not (0). * Pclass: The passenger's ticket class (1st, 2nd, or 3rd class). * Name: The full name of the passenger. * Sex: The gender of the passenger (male or female). * Age: The age of the passenger in years. * SibSp: The number of siblings or spouses aboard the Titanic with the passenger. * Parch: The number of parents or children aboard the Titanic with the passenger. * Ticket: The ticket number. * Fare: The passenger's fare. * Cabin: The cabin number. * Embarked: The port from which the passenger embarked (Cherbourg, Queenstown, or Southampton).
The dataset is provided as a CSV file named Titanic-Dataset.csv, with a size of 61.19 kB. It features 12 columns. Most columns contain 891 valid records, representing the total number of passengers. However, the 'Age' column has 177 missing values (20%), 'Cabin' has 687 missing values (77%), and 'Embarked' has 2 missing values.
This dataset is ideally suited for: * Developing classification models to predict passenger survival. * Conducting data clean-up and exploratory data analysis. * Experimenting with hyperparameter tuning for machine learning algorithms. * Comparing the performance of various classification algorithms to determine the most effective predictive approach.
The dataset covers passengers and crew involved in the RMS Titanic's maiden voyage on 15 April 1912. The demographic scope includes individuals across different ages, genders, socio-economic classes, and family structures. Geographic relevance is tied to the ports of embarkation: Cherbourg, Queenstown, and Southampton. It should be noted that there are significant gaps in data availability for passenger age (20% missing) and cabin numbers (77% missing).
This dataset is under a CC0: Public Domain license.
This dataset is highly valuable for: * Machine Learning Engineers: To build, train, and evaluate predictive models. * Data Scientists: For in-depth statistical analysis and feature engineering. * Students and Beginners in Data Science: It is classified as a "Beginner" dataset, making it an excellent resource for learning classification tasks and data pre-processing. * Researchers: Interested in historical data analysis and factors influencing survival in disaster scenarios.
Original Data Source: Titanic Survival Prediction Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
-**About this Data :** Social media platforms have become the most prominent medium for spreading hate speech, primarily through hateful textual content. An extensive dataset containing emoticons, emojis, hashtags, slang, and contractions is required to detect hate speech on social media based on current trends. This dataset contains hate speech sentences in English and is confined into two classes, one representing hateful content and the other representing non-hateful content.
Specifications table | |
---|---|
Subject | Natural Language Processing - NLP |
Specific subject area | A curated dataset comprising emojis, emoticons, and contractions bundled into two classes, hateful and non-hateful, to detect hate speech in text. |
Type of data | Text |
Data format | Annotated, Analysed, Filtered Data |
Data Article | A curated dataset for hate speech detection on social media text |
Data source location | https://data.mendeley.com/datasets/9sxpkmm8xn/1 |
-**Value of this Data :**
1. This dataset is useful for training machine learning models to identify hate speech on social media in text. It reflects current social media trends and the modern ways of writing hateful text, using emojis, emoticons, or slang. It will help social media managers, administrators, or companies develop automatic systems to filter out hateful content on social media by identifying a text and categorizing it as hateful or non-hateful speech.
2. Deep Learning (DL) and Natural Language Processing (NLP) practitioners can be the target beneficiaries as this dataset can be used for detecting hateful speech through DL and NLP techniques. Here the samples are composed of text sentences and labels belonging to two categories “0″ for non-hateful and “1″ for hateful.
3. Additionally, this data set can be used as a benchmark data set to detect hate speech
4. The data set is neutralized in such a way that it can be used by anyone as it doesn't include any entities or names which can have an impact or cyber harm on the user that generated the content. Researchers can take advantage of the pre-processed dataset for their projects as it maintains and follows the policy guidelines.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Instagram Multi-Class Fake Account Dataset (IMFAD) contains four account types: physical, spam, fraud, and bot accounts, making it suitable for multi-class classification tasks.
2) Data Utilization (1) Instagram Multi-Class Fake Account Dataset (IMFAD) has characteristics that: • It was created as part of a fake Instagram account detection research project. • All data has been anonymized and refined to be made public safely. (2) Instagram Multi-Class Fake Account Dataset (IMFAD) can be used to: • Machine Learning Model Study: Helps students, researchers, and developers build models for fake account detection.