Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To create the dataset, the top 10 countries leading in the incidence of COVID-19 in the world were selected as of October 22, 2020 (on the eve of the second full of pandemics), which are presented in the Global 500 ranking for 2020: USA, India, Brazil, Russia, Spain, France and Mexico. For each of these countries, no more than 10 of the largest transnational corporations included in the Global 500 rating for 2020 and 2019 were selected separately. The arithmetic averages were calculated and the change (increase) in indicators such as profitability and profitability of enterprises, their ranking position (competitiveness), asset value and number of employees. The arithmetic mean values of these indicators for all countries of the sample were found, characterizing the situation in international entrepreneurship as a whole in the context of the COVID-19 crisis in 2020 on the eve of the second wave of the pandemic. The data is collected in a general Microsoft Excel table. Dataset is a unique database that combines COVID-19 statistics and entrepreneurship statistics. The dataset is flexible data that can be supplemented with data from other countries and newer statistics on the COVID-19 pandemic. Due to the fact that the data in the dataset are not ready-made numbers, but formulas, when adding and / or changing the values in the original table at the beginning of the dataset, most of the subsequent tables will be automatically recalculated and the graphs will be updated. This allows the dataset to be used not just as an array of data, but as an analytical tool for automating scientific research on the impact of the COVID-19 pandemic and crisis on international entrepreneurship. The dataset includes not only tabular data, but also charts that provide data visualization. The dataset contains not only actual, but also forecast data on morbidity and mortality from COVID-19 for the period of the second wave of the pandemic in 2020. The forecasts are presented in the form of a normal distribution of predicted values and the probability of their occurrence in practice. This allows for a broad scenario analysis of the impact of the COVID-19 pandemic and crisis on international entrepreneurship, substituting various predicted morbidity and mortality rates in risk assessment tables and obtaining automatically calculated consequences (changes) on the characteristics of international entrepreneurship. It is also possible to substitute the actual values identified in the process and following the results of the second wave of the pandemic to check the reliability of pre-made forecasts and conduct a plan-fact analysis. The dataset contains not only the numerical values of the initial and predicted values of the set of studied indicators, but also their qualitative interpretation, reflecting the presence and level of risks of a pandemic and COVID-19 crisis for international entrepreneurship.
Facebook
TwitterSummarize big data with pivot table and charts and slicers
Facebook
TwitterDataset Link: pakistan’s_largest_ecommerce_dataset Cleaned Data: Cleaned_Pakistan’s_largest_ecommerce_dataset
Rows: 584525 **Columns: **21
All the raw data transformed and saved in new Excel file Working – Pakistan Largest Ecommerce Dataset
Rows: 582250 Columns: 22 Visualization: Here is the link of Visualization report link: Pakistan-s-largest-ecommerce-data-Power-BI-Data-Visualization-Report
In categories Mobiles & Tables make more money by selling highest no of products and also providing highest amount of discount on products. On the other side Men’s Fashion Category has sell second highest no of products but it can’t generate money with that ratio, may be the prices of individual products is a good reason behind that. And in orders details we experience Mobiles & Tablets have highest no of canceled orders but completed orders are almost same as Men’s Fashion. We have mostly completed orders but have huge no of canceled orders. In payment methods cod has most no of completed order and mostly canceled orders have payment method Easyaxis.
Facebook
TwitterThis dataset was created by Pinky Verma
Facebook
TwitterMicrosoft Excel based (using Visual Basic for Applications) data-reduction and visualization tools have been developed that allow to numerically reduce large sets of geothermal data to any size. The data can be quickly sifted through and graphed to allow their study. The ability to analyze large data sets can yield responses to field management procedures that would otherwise be undetectable. Field-wide trends such as decline rates, response to injection, evolution of superheat, recording instrumentation problems and data inconsistencies can be quickly queried and graphed. The application of these newly developed tools to data from The Geysers geothermal field is illustrated. A copy of these tools may be requested by contacting the authors.
Facebook
TwitterPython is a free computer language that prioritizes readability for humans and general application. It is one of the easier computer languages to learn and start especially with no prior programming knowledge. I have been using Python for Excel spreadsheet automation, data analysis, and data visualization. It has allowed me to better focus on learning how to automate my data analysis workload. I am currently examining the North Carolina Department of Environmental Quality (NCDEQ) database for water quality sampling for the Town of Nags Head, NC. It spans over 26 years (1997-2023) and lists a total of currently 41 different testing site locations. You can see at the bottom of image 2 below that I have 148,204 testing data points for the entirety of the NCDEQ testing for the state. From this large dataset 34,759 data points are from Dare County (Nags Head) specifically with this subdivided into testing sites.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the raw experimental data and supplementary materials for the "Asymmetry Effects in Virtual Reality Rod and Frame Test". The materials included are:
• Raw Experimental Data: older.csv and young.csv
• Mathematica Notebooks: a collection of Mathematica notebooks used for data analysis and visualization. These notebooks provide scripts for processing the experimental data, performing statistical analyses, and generating the figures used in the project.
• Unity Package: a Unity package featuring a sample scene related to the project. The scene was built using Unity’s Universal Rendering Pipeline (URP). To utilize this package, ensure that URP is enabled in your Unity project. Instructions for enabling URP can be found in the Unity URP Documentation.
Requirements:
• For Data Files: software capable of opening CSV files (e.g., Microsoft Excel, Google Sheets, or any programming language that can read CSV formats).
• For Mathematica Notebooks: Wolfram Mathematica software to run and modify the notebooks.
• For Unity Package: Unity Editor version compatible with URP (2019.3 or later recommended). URP must be installed and enabled in your Unity project.
Usage Notes:
• The dataset facilitates comparative studies between different age groups based on the collected variables.
• Users can modify the Mathematica notebooks to perform additional analyses.
• The Unity scene serves as a reference to the project setup and can be expanded or integrated into larger projects.
Citation: Please cite this dataset when using it in your research or publications.
Facebook
Twitter[NOTE - 2022-09-07: this dataset is superseded by an updated version https://doi.org/10.15482/USDA.ADC/1526332 ] This dataset contains soil water content data developed from neutron probe readings taken in access tubes in two of the four large, precision weighing lysimeters and in the fields surrounding each lysimeter that were planted to winter wheat at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU), Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL) beginning in 1989. Data in each spreadsheet are for one winter wheat growing season, either 1989-1990, 1991-1992, or 1992-1993. Other readings taken in those years for other crops are reported elsewhere. Data for the 1989-1990 season and the 1992-1993 season are from the northwest (NW) and southwest (SW) weighing lysimeters and surrounding fields. Data for the 1991-1992 season are from the northeast (NE) and southeast (SE) weighing lysimeters and surrounding fields. Readings were taken periodically with a field-calibrated neutron probe at depths from 10 cm to 230 cm (maximum of 190 cm depth in the lysimeters) in 20-cm depth increments. Periods between readings were typically one to two weeks, sometimes longer according to experimental design and need for data. Field calibrations in the Pullman soil series were done every few years. Calibrations typically produced a regression equation with RMSE <= 0.01 m3 m-3 (e.g., Evett and Steiner, 1995). Data were used to guide irrigation scheduling to achieve full or deficit irrigation as required by the experimental design. Data may be used to calculate the soil profile water content in mm of water from the surface to the maximum depth of reading. Profile water content differences between reading times in the same access tube are considered the change in soil water storage during the period in question and may be used to compute evapotranspiration (ET) using the soil water balance equation: ET = (change in storage + P + I + F + R, where P is precipitation during the period, I is irrigation during the period, F is soil water flux (drainage) out of the bottom of the soil profile during the period, and R is the sum of runon and runoff during the period. Typically, R is taken as zero because the fields were furrow diked to prevent runon and runoff during most of each growing season. Resources in this dataset:Resource Title: 1989-90 Bushland, TX, west winter wheat volumetric soil water content data. File Name: 1989-90_West_Winter-Wheat_Soil-water.xlsxResource Description: Contains periodic volumetric soil water content data from neutron probe readings in 20-cm depth increments from 10-cm depth to 230-cm depth in access tubes in fields around the Bushland, TX, northwest (NW) and southwest (SW) large, precision weighing lysimeters, and to 190-cm depth in each lysimeter. The excel file contains a data dictionary for each tab containing data. There is also a tab named Introduction that lists the authors, equipment used, relevant citations, and explains the other tabs, which contain either data dictionaries, data, geographical coordinates of access tube locations, or data visualization tools. Tab names are unique so that tabs may be saved as individual CSV files.Resource Title: 1991-92 Bushland, TX, east winter wheat volumetric soil water content data. File Name: 1991-92_East_Winter-Wheat_Soil-water.xlsxResource Description: Contains periodic volumetric soil water content data from neutron probe readings in 20-cm depth increments from 10-cm depth to 230-cm depth in access tubes in fields around the Bushland, TX, large, northeast (NE) and southeast (SE) precision weighing lysimeters, and to 190-cm depth in each lysimeter. The excel file contains a data dictionary for each tab containing data. There is also a tab named Introduction that lists the authors, equipment used, relevant citations, and explains the other tabs, which contain either data dictionaries, data, geographical coordinates of access tube locations, or data visualization tools. Tab names are unique so that tabs may be saved as individual CSV files.Resource Title: 1992-93 Bushland, TX, west winter wheat volumetric soil water content data. File Name: 1992-93_West_Winter-Wheat_Soil-water.xlsxResource Description: Contains periodic volumetric soil water content data from neutron probe readings in 20-cm depth increments from 10-cm depth to 230-cm depth in access tubes in fields around the Bushland, TX, northwest (NW) and southwest (SW) large, precision weighing lysimeters, and to 190-cm depth in each lysimeter. The excel file contains a data dictionary for each tab containing data. There is also a tab named Introduction that lists the authors, equipment used, relevant citations, and explains the other tabs, which contain either data dictionaries, data, geographical coordinates of access tube locations, or data visualization tools. Tab names are unique so that tabs may be saved as individual CSV files.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.
- Country: Name of the country.
- Density (P/Km2): Population density measured in persons per square kilometer.
- Abbreviation: Abbreviation or code representing the country.
- Agricultural Land (%): Percentage of land area used for agricultural purposes.
- Land Area (Km2): Total land area of the country in square kilometers.
- Armed Forces Size: Size of the armed forces in the country.
- Birth Rate: Number of births per 1,000 population per year.
- Calling Code: International calling code for the country.
- Capital/Major City: Name of the capital or major city.
- CO2 Emissions: Carbon dioxide emissions in tons.
- CPI: Consumer Price Index, a measure of inflation and purchasing power.
- CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
- Currency_Code: Currency code used in the country.
- Fertility Rate: Average number of children born to a woman during her lifetime.
- Forested Area (%): Percentage of land area covered by forests.
- Gasoline_Price: Price of gasoline per liter in local currency.
- GDP: Gross Domestic Product, the total value of goods and services produced in the country.
- Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
- Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.
- Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.
- Largest City: Name of the country's largest city.
- Life Expectancy: Average number of years a newborn is expected to live.
- Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
- Minimum Wage: Minimum wage level in local currency.
- Official Language: Official language(s) spoken in the country.
- Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
- Physicians per Thousand: Number of physicians per thousand people.
- Population: Total population of the country.
- Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
- Tax Revenue (%): Tax revenue as a percentage of GDP.
- Total Tax Rate: Overall tax burden as a percentage of commercial profits.
- Unemployment Rate: Percentage of the labor force that is unemployed.
- Urban Population: Percentage of the population living in urban areas.
- Latitude: Latitude coordinate of the country's location.
- Longitude: Longitude coordinate of the country's location.
- Analyze population density and land area to study spatial distribution patterns.
- Investigate the relationship between agricultural land and food security.
- Examine carbon dioxide emissions and their impact on climate change.
- Explore correlations between economic indicators such as GDP and various socio-economic factors.
- Investigate educational enrollment rates and their implications for human capital development.
- Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.
- Study labor market dynamics through indicators such as labor force participation and unemployment rates.
- Investigate the role of taxation and its impact on economic development.
- Explore urbanization trends and their social and environmental consequences.
Data Source: This dataset was compiled from multiple data sources
If this was helpful, a vote is appreciated ❤️ Thank you 🙂
Facebook
Twitterhttps://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
According to Cognitive Market Research, the global GPU Database market size was USD 455 million in 2024 and will expand at a compound annual growth rate (CAGR) of 20.7% from 2024 to 2031. Market Dynamics of GPU Database Market Key Drivers for GPU Database Market Growing Demand for High-Performance Computing in Various Data-Intensive Industries- One of the main reasons the GPU Database market is growing demand for high-performance computing (HPC) across various data-intensive industries. These industries, including finance, healthcare, and telecommunications, require rapid data processing and real-time analytics, which GPU databases excel at providing. Unlike traditional CPU databases, GPU databases leverage the parallel processing power of GPUs to handle complex queries and large datasets more efficiently. This capability is crucial for applications such as machine learning, artificial intelligence, and big data analytics. The expansion of data and the increasing need for speed and scalability in processing are pushing enterprises to adopt GPU databases. Consequently, the market is poised for robust growth as organizations continue to seek solutions that offer enhanced performance, reduced latency, and greater computational power to meet their evolving data management needs. The increasing demand for gaining insights from large volumes of data generated across verticals to drive the GPU Database market's expansion in the years ahead. Key Restraints for GPU Database Market Lack of efficient training professionals poses a serious threat to the GPU Database industry. The market also faces significant difficulties related to insufficient security options. Introduction of the GPU Database Market The GPU database market is experiencing rapid growth due to the increasing demand for high-performance data processing and analytics. GPUs, or Graphics Processing Units, excel in parallel processing, making them ideal for handling large-scale, complex data sets with unprecedented speed and efficiency. This market is driven by the proliferation of big data, advancements in AI and machine learning, and the need for real-time analytics across industries such as finance, healthcare, and retail. Companies are increasingly adopting GPU-accelerated databases to enhance data visualization, predictive analytics, and computational workloads. Key players in this market include established tech giants and specialized startups, all contributing to a competitive landscape marked by innovation and strategic partnerships. As organizations continue to seek faster and more efficient ways to harness their data, the GPU database market is poised for substantial growth, reshaping the future of data management and analytics.< /p>
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public reporting of measures of hospital performance is an important component of quality improvement efforts in many countries. However, it can be challenging to provide an overall characterization of hospital performance because there are many measures of quality. In the United States, the Centers for Medicare and Medicaid Services reports over 100 measures that describe various domains of hospital quality, such as outcomes, the patient experience and whether established processes of care are followed. Although individual quality measures provide important insight, it is challenging to understand hospital performance as characterized by multiple quality measures. Accordingly, we developed a novel approach for characterizing hospital performance that highlights the similarities and differences between hospitals and identifies common patterns of hospital performance. Specifically, we built a semi-supervised machine learning algorithm and applied it to the publicly-available quality measures for 1,614 U.S. hospitals to graphically and quantitatively characterize hospital performance. In the resulting visualization, the varying density of hospitals demonstrates that there are key clusters of hospitals that share specific performance profiles, while there are other performance profiles that are rare. Several popular hospital rating systems aggregate some of the quality measures included in our study to produce a composite score; however, hospitals that were top-ranked by such systems were scattered across our visualization, indicating that these top-ranked hospitals actually excel in many different ways. Our application of a novel graph analytics method to data describing U.S. hospitals revealed nuanced differences in performance that are obscured in existing hospital rating systems.
Facebook
Twitterhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.5683/SP2/E7Z09Bhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.5683/SP2/E7Z09B
Assembled from 196 references, this database records a total of 3,861 cases of historical dam failures around the world and represents the largest compilation of dam failures recorded to date (17-02-2020). Indeed, in this database is recorded historical dam failure regardless of the type of dams (e.g. man-made dam, tailing dam, temporary dam, natural dam, etc.), either the type of structure (e.g. concrete dam, embankment dam, etc.), the type of failure (e.g. pipping failure, overtopping failure, etc.) or the properties of the dams (e.g. dam height, reservoir capacity, etc.). Through this process, a total of 45 variables (i.e. which composed the “dataset”, obtained) have been used (when possible/available and relevant) to record various information about the failure (e.g. dam descriptions, dam properties, breach dimensions, etc.). Coupled with the Excel’s functionalities (e.g. adapted from Excel 2016; customizable screen visualization, individual search of specific cases, data filter, pivot table, etc.), the database file can easily be adapted to the needs of the user (i.e. research field, dam type, dam failure type, etc.) and is considered as a door opening in various fields of research (e.g. such as hydrology, hydraulics and dam safety). Also, notice that the dataset proposed allows any user to optimize the verification process, to identify duplicates and to put back in context the historical dam failures recorded. Overall, this investigation work has aimed to standardize data collection of historical dam failures and to facilitate the international collection by setting guidelines. Indeed, the sharing method (i.e. provided through this link) not only represents a considerable asset for a wide audience (e.g. researchers, dams’ owner, etc.) but, furthermore, allows paving the way for the field of dam safety in the actual era of "Big Data". Updated versions will be deposited (at this DOI) at undetermined frequencies in order to update the data recorded over the years. Cette base de données, compile un total de 3 861 cas de rupture de barrages à travers le monde, soit la plus large compilation de ruptures historiques de barrages actuellement disponible dans la littérature (17-02-2020), et a été obtenue suite à la revue de 196 références. Pour ce faire, les cas de ruptures de barrages historiques recensés ont été enregistrés dans le fichier XLSX fourni, et ce, indépendamment du domaine d’application (ex. barrage construit par l’Homme, barrage à rétention minier, barrage temporaire, barrage naturel, etc.), du type d’ouvrage (ex. barrage en béton, barrage en remblai, etc.), du mode de rupture (ex. rupture par effet de Renard, rupture par submersion, etc.) et des propriétés des ouvrages (ex. hauteur du barrage, capacité du réservoir, etc.). Au fil du processus de compilation, un jeu de 45 variables a été obtenu afin d’enregistrer les informations (lorsque possible/disponible et pertinente) décrivant les données recensées dans la littérature (ex. descriptions du barrage, propriétés du barrage, dimensions de la brèche de rupture, etc.). De ce fait, le travail d’investigation et de compilation, ayant permis d’uniformiser et de standardiser cette collecte de données de différents types de barrages, a ainsi permis de fournir des balises facilitant la collecte de données à l’échelle internationale. Soulignons qu’en couplant la base de données aux fonctionnalités d'Excel (ex. pour Excel 2016: visualisation d'écran personnalisable, recherche individuelle de cas spécifiques, filtre de données, tableau croisé dynamique, etc.), le fichier peut également aisément être adapter aux besoins de son utilisateur (ex. domaine d’étude, type de barrage, type de rupture de barrage, etc.), ouvrant ainsi la porte à de nouvelles études dans divers domaines de recherche (ex. domaine de l’hydrologie, l’hydraulique et de la sécurité des barrages), grâce aux données nouvellement compilées. De ce fait, cette méthode de partage, mise gratuitement à la disposition de la communauté internationale par l’entremise de cette page web, représente donc non seulement un atout considérable pour un large public (ex. chercheurs, propriétaires de barrages, etc.), mais permet au domaine de la sécurité des barrages d’entrer dans l'actuelle ère du « Big Data ». Des versions mises à jour seront par le fait même déposées (via ce DOI) à des fréquences indéterminées afin de mettre à jour les données enregistrées au fil des ans.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
As important carriers of innovation activities, patents, sci-tech achievements and papers play an increasingly prominent role in national political and economic development under the background of a new round of technological revolution and industrial transformation. However, in a distributed and heterogeneous environment, the integration and systematic description of patents, sci-tech achievements and papers data are still insufficient, which limits the in-depth analysis and utilization of related data resources. The dataset of knowledge graph construction for patents, sci-tech achievements and papers is an important means to promote innovation network research, and is of great significance for strengthening the development, utilization, and knowledge mining of innovation data. This work collected data on patents, sci-tech achievements and papers from China's authoritative websites spanning the three major industries—agriculture, industry, and services—during the period 2022-2025. After processes of cleaning, organizing, and normalization, a patents-sci-tech achievements-papers knowledge graph dataset was formed, containing 10 entity types and 8 types of entity relationships. To ensure quality and accuracy of data, the entire process involved strict preprocessing, semantic extraction and verification, with the ontology model introduced as the schema layer of the knowledge graph. The dataset establishes direct correlations among patents, sci-tech achievements and papers through inventors/contributors/authors, and utilizes the Neo4j graph database for storage and visualization. The open dataset constructed in this study can serve as important foundational data for building knowledge graphs in the field of innovation, providing structured data support for innovation activity analysis, scientific research collaboration network analysis and knowledge discovery.The dataset consists of two parts. The first part includes three Excel tables: 1,794 patent records with 10 fields, 181 paper records with 7 fields, and 1,156 scientific and technological achievement records with 11 fields. The second part is a knowledge graph dataset in CSV format that can be imported into Neo4j, comprising 10 entity files and 8 relationship files.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .
We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments
--View dataset
SELECT *
FROM netflix;
--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
SELECT show_id, COUNT(*)
FROM netflix
GROUP BY show_id
ORDER BY show_id DESC;
--No duplicates
--Check null values across columns
SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS.
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3
The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column
-- Below, we find out if some directors are likely to work with particular cast
WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast
FROM netflix
)
SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
With this, we can now populate NULL rows in directors
using their record with movie_cast
UPDATE netflix
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;
--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET director = 'Not Given'
WHERE director IS NULL;
--When I was doing this, I found a less complex and faster way to populate a column which I will use next
Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column
--Populate the country using the director column
SELECT COALESCE(nt.country,nt2.country)
FROM netflix AS nt
JOIN netflix AS nt2
ON nt.director = nt2.director
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id
AND netflix.country IS NULL;
--To confirm if there are still directors linked to country that refuse to update
SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET country = 'Not Given'
WHERE country IS NULL;
The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization
--Show date_added nulls
SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;
--DELETE nulls
DELETE F...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Datasets contains 4 files- the excel file is the original file after scraping the data from the website but is very raw and uncleaned. After spending a lot of time, I tried to clean the data, which I thought fits best to represent the dataset and can be used for projects. Explore all the datasets and share your notebooks and insights! Consider upvoting if you find it helpful, Thank you.
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
In this project, I conducted a comprehensive analysis of retail and warehouse sales data to derive actionable insights. The primary objective was to understand sales trends, evaluate performance across channels, and identify key contributors to overall business success.
To achieve this, I transformed raw data into interactive Excel dashboards that highlight sales performance and channel contributions, providing a clear and concise representation of business metrics.
Key Highlights of the Project:
Created two dashboards: Sales Dashboard and Contribution Dashboard. Answered critical business questions, such as monthly trends, channel performance, and top contributors. Presented actionable insights with professional visuals, making it easy for stakeholders to make data-driven decisions.
Facebook
TwitterSupply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Iris Dataset consists of 150 iris samples, each having four numerical features: sepal length, sepal width, petal length, and petal width. Each sample is categorized into one of three iris species: Setosa, Versicolor, or Virginica. This dataset is widely used as a sample dataset in machine learning and statistics due to its simple and easily understandable structure.
Feature Information : - Sepal Length (cm) - Sepal Width (cm) - Petal Length (cm) - Petal Width (cm)
Target Information : - Iris Species : 1. Setosa 1. Versicolor 1. Virginica
Source : The Iris Dataset is obtained from the scikit-learn (sklearn) library under the BSD (Berkeley Software Distribution) license.
File Formats :
The Iris Dataset is one of the most iconic datasets in the world of machine learning and data science. This dataset contains information about three species of iris flowers: Setosa, Versicolor, and Virginica. With features like sepal and petal length and width, the Iris dataset has been a stepping stone for many beginners in understanding the fundamental concepts of classification and data analysis. With its clarity and diversity of features, the Iris dataset is perfect for exploring various machine learning techniques and building accurate classification models. I present the Iris dataset from scikit-learn with the hope of providing an enjoyable and inspiring learning experience for the Kaggle community!
Facebook
TwitterThe Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.
This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.
https://i.imgur.com/6UEqejq.png" alt="">
This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.
Cover Photo by: Freepik
Thumbnail by: Clothing icons created by Flat Icons - Flaticon
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To create the dataset, the top 10 countries leading in the incidence of COVID-19 in the world were selected as of October 22, 2020 (on the eve of the second full of pandemics), which are presented in the Global 500 ranking for 2020: USA, India, Brazil, Russia, Spain, France and Mexico. For each of these countries, no more than 10 of the largest transnational corporations included in the Global 500 rating for 2020 and 2019 were selected separately. The arithmetic averages were calculated and the change (increase) in indicators such as profitability and profitability of enterprises, their ranking position (competitiveness), asset value and number of employees. The arithmetic mean values of these indicators for all countries of the sample were found, characterizing the situation in international entrepreneurship as a whole in the context of the COVID-19 crisis in 2020 on the eve of the second wave of the pandemic. The data is collected in a general Microsoft Excel table. Dataset is a unique database that combines COVID-19 statistics and entrepreneurship statistics. The dataset is flexible data that can be supplemented with data from other countries and newer statistics on the COVID-19 pandemic. Due to the fact that the data in the dataset are not ready-made numbers, but formulas, when adding and / or changing the values in the original table at the beginning of the dataset, most of the subsequent tables will be automatically recalculated and the graphs will be updated. This allows the dataset to be used not just as an array of data, but as an analytical tool for automating scientific research on the impact of the COVID-19 pandemic and crisis on international entrepreneurship. The dataset includes not only tabular data, but also charts that provide data visualization. The dataset contains not only actual, but also forecast data on morbidity and mortality from COVID-19 for the period of the second wave of the pandemic in 2020. The forecasts are presented in the form of a normal distribution of predicted values and the probability of their occurrence in practice. This allows for a broad scenario analysis of the impact of the COVID-19 pandemic and crisis on international entrepreneurship, substituting various predicted morbidity and mortality rates in risk assessment tables and obtaining automatically calculated consequences (changes) on the characteristics of international entrepreneurship. It is also possible to substitute the actual values identified in the process and following the results of the second wave of the pandemic to check the reliability of pre-made forecasts and conduct a plan-fact analysis. The dataset contains not only the numerical values of the initial and predicted values of the set of studied indicators, but also their qualitative interpretation, reflecting the presence and level of risks of a pandemic and COVID-19 crisis for international entrepreneurship.