Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cancer, the second-leading cause of mortality, kills 16% of people worldwide. Unhealthy lifestyles, smoking, alcohol abuse, obesity, and a lack of exercise have been linked to cancer incidence and mortality. However, it is hard. Cancer and lifestyle correlation analysis and cancer incidence and mortality prediction in the next several years are used to guide people’s healthy lives and target medical financial resources. Two key research areas of this paper are Data preprocessing and sample expansion design Using experimental analysis and comparison, this study chooses the best cubic spline interpolation technology on the original data from 32 entry points to 420 entry points and converts annual data into monthly data to solve the problem of insufficient correlation analysis and prediction. Factor analysis is possible because data sources indicate changing factors. TSA-LSTM Two-stage attention design a popular tool with advanced visualization functions, Tableau, simplifies this paper’s study. Tableau’s testing findings indicate it cannot analyze and predict this paper’s time series data. LSTM is utilized by the TSA-LSTM optimization model. By commencing with input feature attention, this model attention technique guarantees that the model encoder converges to a subset of input sequence features during the prediction of output sequence features. As a result, the model’s natural learning trend and prediction quality are enhanced. The second step, time performance attention, maintains We can choose network features and improve forecasts based on real-time performance. Validating the data source with factor correlation analysis and trend prediction using the TSA-LSTM model Most cancers have overlapping risk factors, and excessive drinking, lack of exercise, and obesity can cause breast, colorectal, and colon cancer. A poor lifestyle directly promotes lung, laryngeal, and oral cancers, according to visual tests. Cancer incidence is expected to climb 18–21% between 2020 and 2025, according to 2021. Long-term projection accuracy is 98.96 percent, and smoking and obesity may be the main cancer causes.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Case Study: How Can a Wellness Technology Company Play It Smart?
This is my first case study as a data analyst using Excel, Tableau, and R. This case study is a part of my Google Data Analytics Professional Certification. I know there may be some insights presented differently or any insights might not be covered as per the point of view of the reader who can provide feedback. Feedback will be appreciated.
Scenario:
The Bellabeat data analysis case study! In this case, the study is to perform the real-world tasks of a junior data analyst. Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company and present analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
The Case Study Roadmap followed, In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.
Ask:
Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation.
These questions will guide your analysis:
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat's marketing strategy?
To produce a report with the following deliverables:
1. A clear summary of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of your analysis
5. Supporting visualizations and key findings
6. Your top high-level content recommendations based on your analysis
Prepare: includes Dataset used, Accessibility and privacy of data, Information about our dataset, Data organization and verification, Data credibility and integrity.
The dataset used for analysis is from Kaggle, which is considered a reliable source. Dataset owner Sršen encourages to use of public data that explores smart device users’ daily habits. She points you to a specific data set: Fitbit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness trackers from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. Sršen tells that this data set might have some limitations, and encourages us to consider adding other data to help address those limitations by beginning to work more with this data. But, this analysis only confined primarily to the present dataset and has not yet been done analysis by adding other data to address any limitations of this dataset. I may take up later to collect additional datasets based on the availability of those datasets for individual analyst circumstances since companies provide datasets that are needed, may be available on a subscription basis or I need to search and access for similar product datasets. That is my limitation to confine my analysis to this dataset only.
Process Phase:
1. Tools used for Analysis: Excel, Tableau, R studio, Kaggle
2. Cleaning of Data: includes removal of duplication of data but data itself by its nature includes Id, dates include repetition and also there are zero values by nature of recording since human beings are body and mind are complex, so the possibility of zero values inherent in data or any other reason yet to be known but an analysis done based on available data though which is not correct for live projects where someone available to discuss them.
3. Analysis was done based on available variables.
Analyze Phase:
Id Avg.VeryActiveDistance Avg.ModerateActiveDistance Avg.LightActiveDistance
TotalDistance Avg.Calories
1927972279 0.09580645 0.031290323 0.050709677
2026352035 0.006129032 0.011290322 3.43612904
3977333714 1.614999982 2.75099979 3.134333344
8053475328 8.514838742 0.423870965 2.533870955
8877689391 6.637419362 0.337741935 6.188709674 3420.258065 409.5...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project presents a comprehensive analysis of global electricity production by various sources—coal, gas, nuclear, hydro, oil, solar, wind, bioenergy, and other renewables—across different countries and regions. The dataset, compiled from reliable international energy sources, has been cleaned and structured to support multi-platform exploration.
The analysis is carried out using Python (for data preprocessing and visualization), Microsoft Excel (for pivot tables and charts), Tableau (for interactive dashboards), and Power BI (for dynamic reporting). Each tool complements the others by offering diverse perspectives on electricity production patterns. From geographic visualizations to trend analysis, this multi-tool project highlights energy source dominance, regional disparities, and the pace of renewable adoption worldwide—contributing to informed discussions on energy transition and sustainability.
Columns description:
The dataset contains country-wise electricity production data categorized by energy source. The ‘Country’ column lists the country or region (e.g., ASEAN, G20, OECD), while the ‘Code’ column includes country codes (though often left blank). The ‘Year’ column specifies the year of each data entry. Energy production is measured in terawatt-hours (TWh) across multiple sources: coal, gas, and oil represent fossil fuels; nuclear captures electricity from atomic energy; and hydro, wind, solar, and bioenergy represent renewables.
An additional column, ‘Other renewables excluding bioenergy’, covers sources like geothermal and less common renewables. Together, these columns provide a comprehensive overview of each country's electricity production profile across different technologies and timelines.
Facebook
TwitterCancer, the second-leading cause of mortality, kills 16% of people worldwide. Unhealthy lifestyles, smoking, alcohol abuse, obesity, and a lack of exercise have been linked to cancer incidence and mortality. However, it is hard. Cancer and lifestyle correlation analysis and cancer incidence and mortality prediction in the next several years are used to guide people’s healthy lives and target medical financial resources. Two key research areas of this paper are Data preprocessing and sample expansion design Using experimental analysis and comparison, this study chooses the best cubic spline interpolation technology on the original data from 32 entry points to 420 entry points and converts annual data into monthly data to solve the problem of insufficient correlation analysis and prediction. Factor analysis is possible because data sources indicate changing factors. TSA-LSTM Two-stage attention design a popular tool with advanced visualization functions, Tableau, simplifies this paper’s study. Tableau’s testing findings indicate it cannot analyze and predict this paper’s time series data. LSTM is utilized by the TSA-LSTM optimization model. By commencing with input feature attention, this model attention technique guarantees that the model encoder converges to a subset of input sequence features during the prediction of output sequence features. As a result, the model’s natural learning trend and prediction quality are enhanced. The second step, time performance attention, maintains We can choose network features and improve forecasts based on real-time performance. Validating the data source with factor correlation analysis and trend prediction using the TSA-LSTM model Most cancers have overlapping risk factors, and excessive drinking, lack of exercise, and obesity can cause breast, colorectal, and colon cancer. A poor lifestyle directly promotes lung, laryngeal, and oral cancers, according to visual tests. Cancer incidence is expected to climb 18–21% between 2020 and 2025, according to 2021. Long-term projection accuracy is 98.96 percent, and smoking and obesity may be the main cancer causes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The population of Metro Vancouver (20110729Regional Growth Strategy Projections Population, Housing and Employment 2006 – 2041 File) will have increased greatly by 2040, and finding a new source of reservoirs for drinking water (2015_ Water Consumption_ Statistics File) will be essential. This issue of drinking water needs to be optimized and estimated (Data Mining file) with the aim of developing the region. Three current sources of water reservoirs for Metro Vancouver are Capilano, Seymour, and Coquitlam, in which the treated water is being supplied to the customer. The linear optimization (LP) model (Optimization, Sensitivity Report File) illustrates the amount of drinking water for each reservoir and region. In fact, the B.C. government has a specific strategy for the growing population till 2040, which leads them toward their goal. In addition, another factor is the new water source for drinking water that needs to be estimated and monitored to anticipate the feasible water source (wells) until 2040. As such, the government will have to make a decision on how much groundwater is used. The goal of the project is two steps: (1) an optimization model for three water reservoirs, and (2) estimating the new source of water to 2040. The process of data analysis for the project includes: the data is analyzed with six software—Trifacta Wrangler, AMPL, Excel Solver, Arc GIS, and SQL—and is visualized in Tableau. 1. Trifacta Wrangler Software clean data (Data Mining file). 2. AMPL and Solver Excel Software optimize drinking water consumption for Metro Vancouver (data in the Optimization and Sensitivity Report file). 3. ArcMap collaborates the raw data and result of the optimization water reservoir and estimating population till 2040 with the ArcGIS software (GIS Map for Tableau file). 4. Visualizing, estimating, and optimizing the source of drinking water for Metro Vancouver until 2040 with SQL software in Tableau (export tableau data file).
Facebook
Twitterhttp://novascotia.ca/opendata/licence.asphttp://novascotia.ca/opendata/licence.asp
The dataset includes various health service-related metrics and indicators related to the Nova Scotia Health. The data is collected from multiple sources within the health system, including Hospital Inpatient, Emergency, Surgical Databases, Continuing Care Home Support and Long-term Care Reports and Emergency Health Services (EHS).The data is aggregated and anonymized to ensure privacy and does not contain any personally identifiable health information. This data set is used to build the Action for Health Public Reporting and the goal of this project is to provide accessible healthcare information to the general public, researchers, and analysts in order to improve understanding and foster improvements in the healthcare system in Nova Scotia.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This research work was developed as part of my doctoral thesis on the artistic production techniques of bronze sculptures by Auguste Rodin – at DCR-FCT-NOVA, funded by the Fundação para a Ciência e Tecnologia.
The text presented elwere sought to focus on the gap in knowledge about chemical formulations for the production of colored oxidized surfaces on copper alloy sculptures between 1840-1917. Here we carry out a literature review, and propose a synthesis, in the form of tables, of the possible formulations in that period (systematization of secondary data), in order to contribute to the understanding of the artistic production techniques of these surface finishes.
This knowledge is essential for a critical analysis of the conservation of sculptures in copper alloys from this period, and essential to be able to make any kind of analogy based on the material characterization data of works of art from this period. The information conveyed here brings light to methodologies in the 19th century for obtaining the 7 main colors, will allow discussing terminology issues related to chemical formulation in the 19th century, as well as providing secondary data that will enable the material characterization of the original surface of the sculptures in “ patinated bronze”.
we present here a set of 4 primary data systematization tables on which the production of the text is based. The tables were created in Excel, because of the estension of the data present. The text is in Englis, french and Portuguese
Table 1 Systematization of transcribed wordings (original data);
Table 2: systematization of standardized formulations
Table 3: glossary plus Bibliographical references/ Reference Bibliographiques / Referências Bibliográficas
table 4: Sources / Cases / resume
Table 5:Bronzeurs / Metteurs en couleurs
Table 6: measures and equivalences
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This dataset combines publicly available data on obesity rates, poverty rates, and median household income for all 50 U.S. states from 2019 to 2023. It also includes calculated regional averages based on U.S. Census Bureau-defined regions (Northeast, Midwest, South, and West).
Use Cases - Public health research - Data visualization projects - Socioeconomic analysis - ML models exploring health + income
Sources - CDC BRFSS – Adult Obesity Prevalence Maps (2019–2023) - U.S. Census Bureau – SAIPE Datasets (2019–2023)
Tableau Dashboard
View the interactive Tableau dashboard:
https://public.tableau.com/app/profile/geo.montes/viz/ObesityPovertyandIncomeintheU_S_2019-2023/Dashboard1#2
Created by Geo Montes, Informatics major at UT Austin
Facebook
TwitterCet ensemble de données contient les données du budget annuel du Conseil. Le budget comprend les tableaux A à F et les annexes 1 et 2. Chaque tableau est représenté par un fichier de données distinct.
Le tableau D est l'analyse des recettes budgétaires provenant des biens et services. Il contient –
«Revenu» pour les biens et services par «Source de revenus» pour l’exercice budgétaire
«Revenu» pour les biens et services par «Source de revenus» pour l’exercice précédent
Les données de cet ensemble de données sont mieux interprétées que le tableau D du document budgétaire annuel publié, qui se trouve à l’adresse www.fingal.ie
Les champs de données du tableau D sont les suivants –
Doc: Référence du tableau
Rubrique : Indique les sections du tableau - le tableau D est composé d'une section, par conséquent la valeur de l'en-tête pour tous les enregistrements = 1
Ref: Source de revenu Référence
Desc : Source de revenu Description
Inc: "Revenu" adopté par le Conseil pour l'exercice budgétaire
PY : "Revenu" pour l'exercice précédent
Trier : Code de tri
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains a collection of questions and answers that have been contextualized to reveal subtle implications and insights. It is focused on helping researchers gain a deeper understanding of how semantics, context, and other factors affect how people interpret and respond to various conversations about different topics. By exploring this dataset, researchers will be able to uncover the underlying principles governing conversation styles, which can then be applied to better understand attitudes among different groups. With its comprehensive coverage of questions from a variety of sources around the web, this dataset offers an invaluable resource for those looking to sleep analyze discourse in terms of sentiment analysis or opinion mining
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use This Dataset
This dataset contains a collection of contextualized questions and answers extracted from various sources around the web, which can be useful for exploring implications and insights. To get started with the dataset:
- Read through the headings on each column in order to understand the data that has been collected - this will help you identify which pieces of information are relevant for your research project.
- Explore each column and view what types of responses have been given in response to particular questions or topics - this will give you an idea as to how people interpret specific topics differently when presented with different contexts or circumstances.
- Next, analyze the responses looking for any patterns or correlations between responses on different topics or contexts - this can help reveal implications and insights previously unknown to you about a particular subject matter. You can also use any data visualization tools such as Tableau or PowerBI to gain deeper understanding into the results and trends within your data set!
- Finally, use these findings to better inform your project by tailoring future questions around any patterns discovered within your analysis!
- To understand the nature of public debates and how people express their opinions in different contexts.
- To better comprehend the implicit attitudes and assumptions inherent in language use, providing insight into discourse norms on a range of issues.
- To gain insight into the use of rhetorical devices, such as exaggeration and deceptive tactics, used to influence public opinion on important topics
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:-----------------------------------------------------------------------------| | context | The context in which the question was asked and the answer was given. (Text) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
As a junior data analyst at a fanfiction analytics consultancy, I was tasked with analyzing how archive warnings are distributed across fanfiction works on Archive of Our Own (AO3). The client is interested in understanding:
Better understanding of archive warnings can:
Dataset includes ~600,000 AO3 fanfiction works, organized across three tables:
works: Metadata on fanfiction works tags: Includes tag types like archive warnings and fandoms work_tag: Many-to-many mapping of works and tags work_id and tag_idtype == "ArchiveWarning" and type == "Fandom"integer, character)| Warning Name | Total Works | % of All Works |
|---|---|---|
| No Archive Warnings Apply | 32,051 | 5.33% |
| Choose Not To Use Archive Warnings | 21,591 | 3.59% |
| Graphic Depictions Of Violence | 5,281 | 0.88% |
| Major Character Death | 3,009 | 0.50% |
| Rape/Non-Con | 1,650 | 0.27% |
# Filter archive warning tags
archive_warnings <- tags %>%
filter(type == "ArchiveWarning") %>%
select(warning_id = id, warning_name = name)
# Filter tag mapping for works that use archive warnings
work_warnings <- work_tag %>%
filter(tag_id %in% archive_warnings$warning_id)
# Total number of works with at least one archive warning
total_works_with_warning <- work_warnings %>%
summarise(total = n_distinct(work_id)) %>%
pull(total)
# Count per warning and join with tag names
warning_summary <- work_warnings %>%
group_by(tag_id) %>%
summarise(total_works_with_warning = n_distinct(work_id)) %>%
mutate(percent_of_all_works = (total_works_with_warning / 601286) * 100) %>%
rename(warning_id =...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Goal: Analyze how wildfire activity has changed across California counties between 2020 and 2025, focusing on total acres burned and the frequency of wildfire events.
Data & Process: The dataset was cleaned using Google Sheets and visualized in Tableau to reveal patterns and trends in wildfire activity over time.
Key Insights:
Certain counties consistently experience higher wildfire activity.
Users can explore how both acres burned and fire frequency vary by county and year.
Tableau Dashboard: View Dashboard
This is a simple, structured dataset for analyzing California wildfires from 2020 to 2025. It includes county-level data on the number of wildfires and total acres burned, making it suitable for time-series analysis, geospatial visualization, and frequency trend exploration.
Screen Shots
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514651%2F0056e4dce61af96d268688d369a0e1d9%2FMapCA.png?generation=1760388920409039&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514651%2Fb224540c58bc471aeb1da8ee5c6a4aab%2FCountyMapCA.png?generation=1760388971146314&alt=media" alt="">
Source: The original wildfire data was collected from publicly available records provided by CAL FIRE and related California wildfire reporting resources.
The dataset has been cleaned and compiled for easier analysis and visualization.
Facebook
TwitterEminem is one of the most influential hip-hop artists of all time, and the Rap God. I acquired this data using Spotify APs and supplemented it with other research to add to my own analysis. You can find my original analysis here: https://kaivalyapowale.com/2020/01/25/eminems-album-trends-and-music-to-be-murdered-by-2020/
My analysis was also published by top hip-hop websites: HipHop 24x7 - Data analysis reveals M2BMB is the most negative album Eminem Pro - Album's data analysis Eminem Pro - Eminem's albums are getting shorter
You can also check out visualizations on Tableau Public for some ideas: https://public.tableau.com/profile/kaivalya.powale#!/
I have primarily used data from Spotify’s API using multiple endpoints for albums and tracks. I supplemented the data with stats from Billboard and calculations from this post.
Here's the explanation for all the audio features provided by Spotify!
I have researched data about album sales from multiple sources online. They are cited in my original analysis.
Here are the Spotify's Album endpoints. Charts data from Billboard. Swear data from this source.
I'd love to see new visualizations using this data or using the sales, swear, or duration for an analysis. It would be wonderful if someone compares this with other hip-hop greats.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This notebook was my final project for the Ironhack Data Analytics Bootcamp. This project also has a Tableau presentation.
The chronic disease indicators (CDI) are a set of surveillance indicators developed by consensus among CDC, the Council of State and Territorial Epidemiologists (CSTE), and the National Association of Chronic Disease Directors (NACDD). CDI enables public health professionals and policymakers to retrieve uniformly defined state-level data for chronic diseases and risk factors that have a substantial impact on public health. These indicators are essential for surveillance, prioritization, and evaluation of public health interventions. Several of the current chronic disease indicators are available and reported on other websites, either by the data source/custodians or by categorical chronic disease programs. However, CDI is the only integrated source for comprehensive access to a wide range of indicators for the surveillance of chronic diseases, conditions, and risk factors at the state level.
The original CDI consisted of 73 indicators adopted in 1998 and amended in 2002. In 2012-13, CDC, CSTE, and NACDD collaborated on a series of reviews that were informed by subject-matter expert opinion to make recommendations for updating CDI. The goal of this review was to ensure that CDI is responsive to the expanded scope and priorities of chronic disease prevention programs in state health departments.
As a result, CDI increased to 124 indicators in the following 18 topic groups: alcohol; arthritis; asthma; cancer; cardiovascular disease; chronic kidney disease; chronic obstructive pulmonary disease; diabetes; immunization; nutrition, physical activity, and weight status; oral health; tobacco; overarching conditions; and new topic areas that include disability, mental health, older adults, reproductive health, and school health. For the first time, CDI includes 22 indicators of systems and environmental change. A total of 201 individual measures are included for the 124 indicators, many of which overlap multiple chronic disease topic areas or are specific to a certain sex or age group.
CDI is an example of collaboration among CDC and state health departments in building a consensus set of state-based health surveillance indicators. This update will help ensure that CDI remains the most relevant and current collection of chronic disease surveillance data for state epidemiologists, chronic disease program officials, and reproductive health and maternal and child health officials. The standardized indicator definitions will also encourage consistency in chronic disease surveillance at the national, state, and local public health levels.
The data has been downloaded from https://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi
Data Columns Reference https://www.cdc.gov/mmwr/pdf/rr/rr6401.pdf
I wouldn't be here without the help of others. This notebook was made with the information from the work of Daniel Wu, and Pedro Moreno.
What is the most common disease in the USA? What are the factors affecting the top disease? Does a disease respect borders? In other words, is a disease limited by state borders? How does a disease change over time?
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cancer, the second-leading cause of mortality, kills 16% of people worldwide. Unhealthy lifestyles, smoking, alcohol abuse, obesity, and a lack of exercise have been linked to cancer incidence and mortality. However, it is hard. Cancer and lifestyle correlation analysis and cancer incidence and mortality prediction in the next several years are used to guide people’s healthy lives and target medical financial resources. Two key research areas of this paper are Data preprocessing and sample expansion design Using experimental analysis and comparison, this study chooses the best cubic spline interpolation technology on the original data from 32 entry points to 420 entry points and converts annual data into monthly data to solve the problem of insufficient correlation analysis and prediction. Factor analysis is possible because data sources indicate changing factors. TSA-LSTM Two-stage attention design a popular tool with advanced visualization functions, Tableau, simplifies this paper’s study. Tableau’s testing findings indicate it cannot analyze and predict this paper’s time series data. LSTM is utilized by the TSA-LSTM optimization model. By commencing with input feature attention, this model attention technique guarantees that the model encoder converges to a subset of input sequence features during the prediction of output sequence features. As a result, the model’s natural learning trend and prediction quality are enhanced. The second step, time performance attention, maintains We can choose network features and improve forecasts based on real-time performance. Validating the data source with factor correlation analysis and trend prediction using the TSA-LSTM model Most cancers have overlapping risk factors, and excessive drinking, lack of exercise, and obesity can cause breast, colorectal, and colon cancer. A poor lifestyle directly promotes lung, laryngeal, and oral cancers, according to visual tests. Cancer incidence is expected to climb 18–21% between 2020 and 2025, according to 2021. Long-term projection accuracy is 98.96 percent, and smoking and obesity may be the main cancer causes.