53 datasets found

S
Predictive data analysis techniques for higher education students dropout
scidb.cn
Updated Apr 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cindy (2023). Predictive data analysis techniques for higher education students dropout [Dataset]. http://doi.org/10.57760/sciencedb.07894
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.07894
Dataset updated
Apr 10, 2023
Dataset provided by
Science Data Bank
Authors
Cindy
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In this research, we have generated student retention alerts. The alerts are classified into two types: preventive and corrective. This classification varies according to the level of maturity of the data systematization process. Therefore, to systematize the data, data mining techniques have been applied. The experimental analytical method has been used, with a population of 13,715 students with 62 sociological, academic, family, personal, economic, psychological, and institutional variables, and factors such as academic follow-up and performance, financial situation, and personal information. In particular, information is collected on each of the problems or a combination of problems that could affect dropout rates. Following the methodology, the information has been generated through an abstract data model to reflect the profile of the dropout student. As advancement from previous research, this proposal will create preventive and corrective alternatives to avoid dropout higher education. Also, in contrast to previous work, we generated corrective warnings with the application of data mining techniques such as neural networks until reaching a precision of 97% and losses of 0.1052. In conclusion, this study pretends to analyze the behavior of students who drop out the university through the evaluation of predictive patterns. The overall objective is to predict the profile of student dropout, considering reasons such as admission to higher education and career changes. Consequently, using a data systematization process promotes the permanence of students in higher education. Once the profile of the dropout has been identified, student retention strategies have been approached, according to the time of its appearance and the point of view of the institution.
E
Exploratory Data Analysis (EDA) Tools Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54164
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across various industries. The market, estimated at $1.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. This expansion is fueled by several key factors. Firstly, the rising adoption of big data analytics and business intelligence initiatives across large enterprises and SMEs is creating a significant demand for efficient EDA tools. Secondly, the growing need for faster, more insightful data analysis to support better decision-making is driving the preference for user-friendly graphical EDA tools over traditional non-graphical methods. Furthermore, advancements in artificial intelligence and machine learning are seamlessly integrating into EDA tools, enhancing their capabilities and broadening their appeal. The market segmentation reveals a significant portion held by large enterprises, reflecting their greater resources and data handling needs. However, the SME segment is rapidly gaining traction, driven by the increasing affordability and accessibility of cloud-based EDA solutions. Geographically, North America currently dominates the market, but regions like Asia-Pacific are exhibiting high growth potential due to increasing digitalization and technological advancements. Despite this positive outlook, certain restraints remain. The high initial investment cost associated with implementing advanced EDA solutions can be a barrier for some SMEs. Additionally, the need for skilled professionals to effectively utilize these tools can create a challenge for organizations. However, the ongoing development of user-friendly interfaces and the availability of training resources are actively mitigating these limitations. The competitive landscape is characterized by a mix of established players like IBM and emerging innovative companies offering specialized solutions. Continuous innovation in areas like automated data preparation and advanced visualization techniques will further shape the future of the EDA tools market, ensuring its sustained growth trajectory.
f
A tabulation of features used in this study.
plos.figshare.com
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonard Tan; Ooi Kiang Tan; Chun Chau Sze; Wilson Wen Bin Goh (2023). A tabulation of features used in this study. [Dataset]. http://doi.org/10.1371/journal.pone.0274299.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274299.t002
Dataset updated
Jun 7, 2023
Dataset provided by
PLOS ONE
Authors
Leonard Tan; Ooi Kiang Tan; Chun Chau Sze; Wilson Wen Bin Goh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A tabulation of features used in this study.
Companion for the article "Profiling a task-based molecular dynamics...
zenodo.org
bin, zip
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Asch Burgos; Christian Asch Burgos; Lucas Mello Schnorr; Lucas Mello Schnorr; Esteban Meneses; Esteban Meneses (2025). Companion for the article "Profiling a task-based molecular dynamics application with a data science approach" [Dataset]. http://doi.org/10.5281/zenodo.15612154
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15612154
Dataset updated
Jun 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christian Asch Burgos; Christian Asch Burgos; Lucas Mello Schnorr; Lucas Mello Schnorr; Esteban Meneses; Esteban Meneses
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the datasets, analysis scripts, and visualization outputs accompanying the paper "Profiling a Task-Based Molecular Dynamics Application with a Data Science Approach", submitted to CARLA 2025. It includes processed performance traces of LeanMD simulations using Charm++, along with the full analysis pipeline implemented using the CharmVZ tool (sources provided as a snapshot of https://github.com/caschb/charmvz). The code is organized into modular components for trace parsing, CSV and Parquet generation, and plot creation using data science libraries. One computational notebook is provided to reproduce the figures from the paper. This release promotes transparency and reproducibility, and supports the broader adoption of open data and software practices in HPC performance analysis.
d
The Globalization of Personal Data (GPD) Project International Survey on...
dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Surveillance Studies Centre (2023). The Globalization of Personal Data (GPD) Project International Survey on Privacy and Surveillance [Dataset]. https://dataone.org/datasets/sha256%3A40f901a6a34637e53687870f937343ca6be1cd5b8243a3503d3f31ea29fedc5b
Explore at:
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Surveillance Studies Centre
Time period covered
Jan 1, 2006 - Jan 1, 2007
Description
The Globalization of Personal Data (GPD) was an international, multi-disciplinary and collaborative research initiative drawing mainly on the social sciences but also including information, computing, technology studies, and law, that explored the implications of processing personal and population data in electronic format from 2004 to 2008. Such data included everything from census statistics to surveillance camera images, from biometric passports to supermarket loyalty cards. The project ma intained a strong concern for ethics, politics and policy development around personal data. The project, funded by the Social Sciences and Humanities Research Council of Canada (SSHRCC) under its Initiative on the New Economy program, conducted research on why surveillance occurs, how it operates, and what this means for people's everyday lives (See http://www.sscqueens.org/projects/gpd). The unique aspect of the GPD included a major international survey on citizens' attitudes to issues of surveillance and privacy. The GPD project was conducted in nine countries: Canada, U.S.A., France, Spain, Hungary, Mexico, Brazil, China, and Japan. Three data files were produced: a Seven-Country file (Canada, U.S.A., France, Spain, Hungary, Mexico, and Brazil), a China file, and a Japan file. Country Report are available for download from QSpace (Queen's University Research and Learning Repository).
p
Trends in Diversity Score (2022-2023): Newark Sch Of Data Science And...
publicschoolreview.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Trends in Diversity Score (2022-2023): Newark Sch Of Data Science And Information Technology vs. New Jersey vs. Newark Public School District [Dataset]. https://www.publicschoolreview.com/newark-sch-of-data-science-and-information-technology-profile
Explore at:
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Newark, Newark City School District, New Jersey
Description
This dataset tracks annual diversity score from 2022 to 2023 for Newark Sch Of Data Science And Information Technology vs. New Jersey and Newark Public School District
p
Trends in Black Student Percentage (2022-2023): Newark Sch Of Data Science...
publicschoolreview.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Trends in Black Student Percentage (2022-2023): Newark Sch Of Data Science And Information Technology vs. New Jersey vs. Newark Public School District [Dataset]. https://www.publicschoolreview.com/newark-sch-of-data-science-and-information-technology-profile
Explore at:
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Newark, Newark City School District, New Jersey
Description
This dataset tracks annual black student percentage from 2022 to 2023 for Newark Sch Of Data Science And Information Technology vs. New Jersey and Newark Public School District
Water Column Profile Data
fisheries.noaa.gov
s.cnmilf.com
+1more
Updated Jan 1, 1982
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Southeast Fisheries Science Center (1982). Water Column Profile Data [Dataset]. https://www.fisheries.noaa.gov/inport/item/30058
Explore at:
Dataset updated
Jan 1, 1982
Dataset provided by
Southeast Fisheries Science Center
Time period covered
1992 - Jul 13, 2125
Area covered

Description
The Southeast Fisheries Science Center Mississippi Laboratories conducts standardized fisheries independent resource surveys in the Gulf of Mexico, South Atlantic, and U.S. Caribbean to provide abundance and distribution information to support regional and international stock assessments. Environmental profiles are acquired during all surveys and are averaged into one meter depth bins. The data...
E
Exploratory Data Analysis (EDA) Tools Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54369
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across industries. The rising need for data-driven decision-making, coupled with the expanding adoption of cloud-based analytics solutions, is fueling market expansion. While precise figures for market size and CAGR are not provided, a reasonable estimation, based on the prevalent growth in the broader analytics market and the crucial role of EDA in the data science workflow, would place the 2025 market size at approximately $3 billion, with a projected Compound Annual Growth Rate (CAGR) of 15% through 2033. This growth is segmented across various applications, with large enterprises leading the adoption due to their higher investment capacity and complex data needs. However, SMEs are witnessing rapid growth in EDA tool adoption, driven by the increasing availability of user-friendly and cost-effective solutions. Further segmentation by tool type reveals a strong preference for graphical EDA tools, which offer intuitive visualizations facilitating better data understanding and communication of findings. Geographic regions, such as North America and Europe, currently hold a significant market share, but the Asia-Pacific region shows promising potential for future growth owing to increasing digitalization and data generation. Key restraints to market growth include the need for specialized skills to effectively utilize these tools and the potential for data bias if not handled appropriately. The competitive landscape is dynamic, with both established players like IBM and emerging companies specializing in niche areas vying for market share. Established players benefit from brand recognition and comprehensive enterprise solutions, while specialized vendors provide innovative features and agile development cycles. Open-source options like KNIME and R packages (Rattle, Pandas Profiling) offer cost-effective alternatives, particularly attracting academic institutions and smaller businesses. The ongoing development of advanced analytics functionalities, such as automated machine learning integration within EDA platforms, will be a significant driver of future market growth. Further, the integration of EDA tools within broader data science platforms is streamlining the overall analytical workflow, contributing to increased adoption and reduced complexity. The market's evolution hinges on enhanced user experience, more robust automation features, and seamless integration with other data management and analytics tools.
c
LifeScience Data Mining And Visualization Market size was USD 5815.2 million...
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2025). LifeScience Data Mining And Visualization Market size was USD 5815.2 million in 2023! [Dataset]. https://www.cognitivemarketresearch.com/lifescience-data-mining-and-visualization-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Apr 25, 2025
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global Lifescience Data Mining And Visualization market size is USD 5815.2 million in 2023 and will expand at a compound annual growth rate (CAGR) of 9.60% from 2023 to 2030.

North America held the major market of more than 40% of the global revenue with a market size of USD 2326.08 million in 2023 and will grow at a compound annual growth rate (CAGR) of 7.8% from 2023 to 2030 Europe held the major market of more than 40% of the global revenue with a market size of USD 1744.56 million in 2023 and will grow at a compound annual growth rate (CAGR) of 8.1% from 2023 to 2030. Asia Pacific held the fastest growing market of more than 23% of the global revenue with a market size of USD 1337.50 million in 2023 and will grow at a compound annual growth rate (CAGR) of 11.6% from 2023 to 2030 Latin America market held of more than 5% of the global revenue with a market size of USD 290.76 million in 2023 and will grow at a compound annual growth rate (CAGR) of 9.0% from 2023 to 2030 Middle East and Africa market held of more than 2.00% of the global revenue with a market size of USD 116.30 million in 2023 and will grow at a compound annual growth rate (CAGR) of 9.3% from 2023 to 2030 The demand for Lifescience Data Mining And Visualizations is rising due to rapid growth in biological data and increasing emphasis on personalized medicine. Demand for On-Demand remains higher in the Lifescience Data Mining And Visualization market. The Pharmaceuticals category held the highest Lifescience Data Mining And Visualization market revenue share in 2023.

Market Dynamics of Lifescience Data Mining And Visualization

Key Drivers of Lifescience Data Mining And Visualization

Advancements in Healthcare Informatics to Provide Viable Market Output

The Lifescience Data Mining and Visualization market are driven by continuous advancements in healthcare informatics. As the life sciences industry generates vast volumes of complex data, sophisticated data mining and visualization tools are increasingly crucial. Advancements in healthcare informatics, including electronic health records (EHRs), genomics, and clinical trial data, provide a wealth of information. Data mining and visualization technologies empower researchers and healthcare professionals to extract meaningful insights, aiding in personalized medicine, drug discovery, and treatment optimization.

August 2020: Johnson & Johnson and Regeneron Pharmaceuticals announced a strategic collaboration to develop and commercialize cancer immunotherapies.

(Source:investor.regeneron.com/news-releases/news-release-details/regeneron-and-cytomx-announce-strategic-research-collaboration)

Rising Focus on Precision Medicine Propel Market Growth

A key driver in the Lifescience Data Mining and Visualization market is the growing focus on precision medicine. As healthcare shifts towards personalized treatment strategies, there is an increasing need to analyze diverse datasets, including genetic, clinical, and lifestyle information. Data mining and visualization tools facilitate the identification of patterns and correlations within this multidimensional data, enabling the development of tailored treatment approaches. The emphasis on precision medicine, driven by advancements in genomics and molecular profiling, positions data mining and visualization as essential components in deciphering the intricate relationships between biological factors and individual health, thereby fostering innovation in life science research and healthcare practices.

In June 2022, SAS Institute Inc. (US) entered into an agreement with Gunvatta (US) to expedite clinical trials and FDA reporting through the SAS Life Science Analytics Framework on Azure.

(Source:www.prnewswire.com/news-releases/clinical-research-and-drug-development-accelerated-via-analytics-301571580.html)

Increasing adoption of artificial intelligence (AI) and machine learning (ML) algorithms is propelling the market growth of life science data mining and visualization

These technologies have revolutionized the ability to analyze and interpret vast, complex datasets in fields such as drug discovery and personalized medicine. For instance, companies like Insitro are utilizing AI-driven models to analyze biological and chemical data, dramatically accelerating drug discovery timelines and optimizing the identification of new therape...
Mining mineral commodity trade llc USA Import & Buyer Data
seair.co.in
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, Mining mineral commodity trade llc USA Import & Buyer Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Mining rock excavation and construc 3700 east 68th USA Import & Buyer Data
seair.co.in
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, Mining rock excavation and construc 3700 east 68th USA Import & Buyer Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Mining rock excavation and constru USA Import & Buyer Data
seair.co.in
Updated Oct 23, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2016). Mining rock excavation and constru USA Import & Buyer Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Oct 23, 2016
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
p
Trends in White Student Percentage (2022-2023): Newark Sch Of Data Science...
publicschoolreview.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Trends in White Student Percentage (2022-2023): Newark Sch Of Data Science And Information Technology vs. New Jersey vs. Newark Public School District [Dataset]. https://www.publicschoolreview.com/newark-sch-of-data-science-and-information-technology-profile
Explore at:
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Newark, Newark City School District, New Jersey
Description
This dataset tracks annual white student percentage from 2022 to 2023 for Newark Sch Of Data Science And Information Technology vs. New Jersey and Newark Public School District
Z
Frictionless Tabular Data Package for GC-MS Rose scent profile data for Data...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susanna Assunta Sansone (2020). Frictionless Tabular Data Package for GC-MS Rose scent profile data for Data published in Nature genetics, June, 2018 & Science, July 2015 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2621541
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Philippe Rocca-Serra
Susanna Assunta Sansone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset, in the form of a Frictionless Tabular Data Package (https://frictionlessdata.io/specs/tabular-data-package/), holds the measurements of 61 known metabolites (all annotated with resolvable CHEBI identifiers and InChi strings), measured by gas chromatography mass-spectrometry (GC-MS) in 6 different Rose cultivars (all annotated with resolvable NCBITaxonomy Identifiers) and 3 organism parts (all annotated with resolvable Plant Ontology identifiers). The quantitation types are annotated with resolvable STATO terms.

The data were extracted from:

a supplementary material table, available from https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-018-0110-3/MediaObjects/41588_2018_110_MOESM3_ESM.zip and published alongside the Nature Genetics manuscript identified by the following doi: https://doi.org/10.1038/s41588-018-0110-3, published in June 2018

a supplementary material table available as a pdf from "Biosynthesis of monoterpene scent compounds in roses" by Magnard et al, Science 03 Jul 2015 identified by the following doi: https://doi.org/10.1126/science.aab0696

This dataset is used to demonstrate how to make data Findable, Accessible, Discoverable and Interoperable (FAIR) and how Frictionless Tabular Data Package representations can be easily mobilised for reanalysis and data science.

It is associated to the following project: https://github.com/proccaserra/rose2018ng-notebook with all the necessary information, executable code and tutorials in the form of Jupyter notebooks.
Data from: Analysis of spatiotemporal specificity of small RNAs regulating...
figshare.com
xlsx
Updated Sep 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Li (2019). Analysis of spatiotemporal specificity of small RNAs regulating hPSC differentiation and beyond [Dataset]. http://doi.org/10.6084/m9.figshare.9911918.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9911918.v2
Dataset updated
Sep 29, 2019
Dataset provided by
Figsharehttp://figshare.com/
Authors
Lu Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a quantitative analysis of small RNA dynamics during the transition from hPSCs to the three germ layer lineages to identify spatiotemporal-specific small RNAs that may be involved in hPSC differentiation. To determine the degree of spatiotemporal specificity, we utilized two algorithms, namely normalized maximum timepoint specificity index (NMTSI) and across-tissue specificity index (ASI). NMTSI could identify spatiotemporal-specific small RNAs that go up or down at just one timepoint in a specific lineage. ASI could identify spatiotemporal-specific small RNAs that maintain high expression from intermediate timepoints to the terminal timepoint in a specific lineage. Beyond analyzing single small RNAs, we also quantified the spatiotemporal-specificity of microRNA families and observed their differential expression patterns in certain lineages. To clarify the regulatory effects of group miRNAs on cellular events during lineage differentiation, we performed a gene ontology (GO) analysis on the downstream targets of synergistically up- and downregulated microRNAs. To provide an integrated interface for researchers to access and browse our analysis results, we designed a web-based tool at https://keyminer.pythonanywhere.com/km/.
Profiling Fake News Spreaders on Twitter
zenodo.org
Updated Sep 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU; FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU (2020). Profiling Fake News Spreaders on Twitter [Dataset]. http://doi.org/10.5281/zenodo.3692319
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3692319
Dataset updated
Sep 20, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU; FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU
Description
Task

Fake news has become one of the main threats of our society. Although fake news is not a new phenomenon, the exponential growth of social media has offered an easy platform for their fast propagation. A great amount of fake news, and rumors are propagated in online social networks with the aim, usually, to deceive users and formulate specific opinions. Users play a critical role in the creation and propagation of fake news online by consuming and sharing articles with inaccurate information either intentionally or unintentionally. To this end, in this task, we aim at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users.

After having addressed several aspects of author profiling in social media from 2013 to 2019 (bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possbile to discriminate authors that have shared some fake news in the past from those that, to the best of our knowledge, have never done it.

As in previous years, we propose the task from a multilingual perspective:

English

Spanish

NOTE: Although we recommend to participate in both languages (English and Spanish), it is possible to address the problem just for one language.

Data

Input

The uncompressed dataset consists in a folder per language (en, es). Each folder contains:

A XML file per author (Twitter user) with 100 tweets. The name of the XML file correspond to the unique author id.

A truth.txt file with the list of authors and the ground truth.

The format of the XML files is:

The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.

b2d5748083d6fdffec6c2d68d4d4442d:::0 2bed15d46872169dc7deaf8d2b43a56:::0 8234ac5cca1aed3f9029277b2cb851b:::1 5ccd228e21485568016b4ee82deb0d28:::0 60d068f9cafb656431e62a6542de2dc0:::1 ...

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Evaluation

The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.

Submission

Once you finished tuning your approach on the validation set, your software will be tested on the test set. During the competition, the test set will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the test corpus and (ii) an absolute path to an empty output directory:

mySoftware -i INPUT-DIRECTORY -o OUTPUT-DIRECTORY

Within OUTPUT-DIRECTORY, we require two subfolders: en and es, one folder per language, respectively. As the provided output directory is guaranteed to be empty, your software needs to create those subfolders. Within each of these subfolders, you need to create one xml file per author. The xml file looks like this:

The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Related Work

Bilal Ghanem, Paolo Rosso, Francisco Rangel. An Emotional Analysis of False Information in Social Media and News Articles. arXiv preprint arXiv:1908.09951 (2019). ACM Transactions on Internet Technology (TOIT). In Press.

Anastasia Giachanou, Paolo Rosso, Fabio Crestani. Leveraging Emotional Signals for Credibility Detection. Proceedings of the 42nd International ACM Conference on Research and Development in Information Retrieval (SIGIR). pp 877–880. (2019)

Andre Guess, Jonathan Nagler, and Joshua Tucker. Less than you think: Prevalence and predictors of fake news dissemination on Facebook. Science Advances vol. 5 (2019)

Andrew Hall, Loren Terveen, Aaron Halfaker. Bot Detection in Wikidata Using Behavioral and Other Informal Cues. Proceedings of the ACM on Human-Computer Interaction. 2018 Nov 1;2(CSCW):64.

Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, Gerhard Weikum. DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp 22-32. (2018)

Francisco Rangel and Paolo Rosso. Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter. In: L. Cappellato, N. Ferro, D. E. Losada and H. Müller (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings.CEUR-WS.org, vol. 2380

Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 6th author profiling task at pan 2018: multimodal gender identification in Twitter. In: CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 2125.

Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato L., Ferro N., Goeuriot L, Mandl T. (Eds.) CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1866.

Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pottast, Benno Stein. Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog K., Capellato L., Ferro N., Macdonald C. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1609, pp. 750-784

Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.

Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.

Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179

Francisco Rangel and Paolo Rosso On the Implications of the General Data Protection Regulation on the Organisation of Evaluation Tasks. In: Language and Law / Linguagem e Direito, Vol. 5(2), pp. 80-102

Kai Shu, Suhang Wang, and Huan Liu. Understanding user profiles on social media for fake news detection. Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 430--435 (2018)

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter. (2017)
Score distributions of student journals.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonard Tan; Ooi Kiang Tan; Chun Chau Sze; Wilson Wen Bin Goh (2023). Score distributions of student journals. [Dataset]. http://doi.org/10.1371/journal.pone.0274299.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274299.t001
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Leonard Tan; Ooi Kiang Tan; Chun Chau Sze; Wilson Wen Bin Goh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Score distributions of student journals.
p
Distribution of Students Across Grade Levels in Newark Sch Of Data Science...
publicschoolreview.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Distribution of Students Across Grade Levels in Newark Sch Of Data Science And Information Technology [Dataset]. https://www.publicschoolreview.com/newark-sch-of-data-science-and-information-technology-profile
Explore at:
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Newark
Description
This dataset tracks annual distribution of students across grade levels in Newark Sch Of Data Science And Information Technology
d
Public Expression Profiling Resource
dknet.org
scicrunch.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Public Expression Profiling Resource [Dataset]. http://identifiers.org/RRID:SCR_007274/resolver
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007274 https://identifiers.org/RRID:SCR_007274/resolver
Dataset updated
Jan 29, 2022
Description
An experiment in web-database access to large multi-dimensional data sets using a standardized experimental platform to determine if the larger scientific community can be given simple, intuitive, and user-friendly web-based access to large microarray data sets. All data in PEPR is also available via NCBI GEO. The structure and goals of PEPR differ from other mRNA expression profiling databases in a number of important ways. * The experimental platform in PEPR is standardized, and is an Affymetrix - only database. All microarrays available in the PEPR web database should ascribe to quality control and standard operating procedures. A recent publication has described the QC/SOP criteria utilized in PEPR profiles ( The Tumor Analysis Best Practices Working Group 2004 ). * PEPR permits gene-based queries of large Affymetrix array data sets without any specialized software. For example, a number of large time series projects are available within PEPR, containing 40-60 microarrays, yet these can be simply queried via a dynamic web interface with no prior knowledge of microarray data analysis. * Projects in PEPR originate from scientists world-wide, but all data has been generated by the Research Center for Genetic Medicine, Children''''s National Medical Center, Washington DC. Future developments of PEPR will allow remote entry of Affymetrix data ascribing to the same QC/SOP protocols. They have previously described an initial implementation of PEPR, and a dynamic web-queried time series graphical interface ( Chen et al. 2004 ). A publication showing the utility of PEPR for pharmacodynamic data has recently been published ( Almon et al. 2003 ).

Facebook

Twitter

Click to copy link

Link copied

Cite

Cindy (2023). Predictive data analysis techniques for higher education students dropout [Dataset]. http://doi.org/10.57760/sciencedb.07894

Predictive data analysis techniques for higher education students dropout

Explore at:

250 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.57760/sciencedb.07894

Dataset updated

Apr 10, 2023

Dataset provided by

Science Data Bank

Authors

Cindy

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

In this research, we have generated student retention alerts. The alerts are classified into two types: preventive and corrective. This classification varies according to the level of maturity of the data systematization process. Therefore, to systematize the data, data mining techniques have been applied. The experimental analytical method has been used, with a population of 13,715 students with 62 sociological, academic, family, personal, economic, psychological, and institutional variables, and factors such as academic follow-up and performance, financial situation, and personal information. In particular, information is collected on each of the problems or a combination of problems that could affect dropout rates. Following the methodology, the information has been generated through an abstract data model to reflect the profile of the dropout student. As advancement from previous research, this proposal will create preventive and corrective alternatives to avoid dropout higher education. Also, in contrast to previous work, we generated corrective warnings with the application of data mining techniques such as neural networks until reaching a precision of 97% and losses of 0.1052. In conclusion, this study pretends to analyze the behavior of students who drop out the university through the evaluation of predictive patterns. The overall objective is to predict the profile of student dropout, considering reasons such as admission to higher education and career changes. Consequently, using a data systematization process promotes the permanence of students in higher education. Once the profile of the dropout has been identified, student retention strategies have been approached, according to the time of its appearance and the point of view of the institution.

Clear search

Close search

Google apps

Main menu

Predictive data analysis techniques for higher education students dropout

Exploratory Data Analysis (EDA) Tools Report

A tabulation of features used in this study.

Companion for the article "Profiling a task-based molecular dynamics...

The Globalization of Personal Data (GPD) Project International Survey on...

Trends in Diversity Score (2022-2023): Newark Sch Of Data Science And...

Trends in Black Student Percentage (2022-2023): Newark Sch Of Data Science...

Water Column Profile Data

Exploratory Data Analysis (EDA) Tools Report

LifeScience Data Mining And Visualization Market size was USD 5815.2 million...

Mining mineral commodity trade llc USA Import & Buyer Data

Mining rock excavation and construc 3700 east 68th USA Import & Buyer Data

Mining rock excavation and constru USA Import & Buyer Data

Trends in White Student Percentage (2022-2023): Newark Sch Of Data Science...

Frictionless Tabular Data Package for GC-MS Rose scent profile data for Data...

Data from: Analysis of spatiotemporal specificity of small RNAs regulating...

Profiling Fake News Spreaders on Twitter

Score distributions of student journals.

Distribution of Students Across Grade Levels in Newark Sch Of Data Science...

Public Expression Profiling Resource

Predictive data analysis techniques for higher education students dropout