100+ datasets found
  1. TMDB movies clean dataset

    • kaggle.com
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bharat Kumar0925 (2024). TMDB movies clean dataset [Dataset]. https://www.kaggle.com/datasets/bharatkumar0925/tmdb-movies-clean-dataset
    Explore at:
    zip(266877093 bytes)Available download formats
    Dataset updated
    Sep 6, 2024
    Authors
    Bharat Kumar0925
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Description

    This dataset contains two files: Large_movies_data.csv and large_movies_clean.csv. The data is taken from the TMDB dataset. Originally, it contained around 900,000 movies, but some movies were dropped for recommendation purposes. Specifically, movies missing an overview were removed since the overview is one of the most important columns for analysis.

    Column Description:

    Large_movies_data.csv:

    • Id: Unique identifier for each movie.
    • Title: The title of the movie.
    • Overview: A brief description of the movie.
    • Genres: The genres associated with the movie.
    • Cast: The main actors in the movie.
    • Director: The director of the movie.
    • Writers: The screenwriters of the movie.
    • Production_companies: Companies involved in producing the movie.
    • Producers: Producers of the movie.
    • Original_language: The original language of the movie.
    • Vote_count: Number of votes the movie has received.
    • Vote_average: Average rating based on user votes.
    • Popularity: Popularity score of the movie.
    • Runtime: Duration of the movie in minutes.
    • Release_date: The release date of the movie.

    Total movies in Large_movies_data.csv: 663,828.

    Large_movies_clean.csv:

    This file is a cleaned version with unnecessary columns removed, text converted to lowercase, and many symbols removed (though some may still remain). If you find that certain features are missing, you can use the original Large_movies_data.csv.

    Columns in large_movies_clean.csv: - Id: Unique identifier for each movie. - Title: The title of the movie. - Tags: Combined information from the overview, genres, and other textual columns. - Original_language: The original language of the movie. - Vote_count: Number of votes the movie has received. - Vote_average: Average rating based on user votes. - Year: Year extracted from the release date. - Month: Month extracted from the release date.

    Possible Use Cases:

    1. Recommendation System: A robust recommendation system can be built using this large dataset.
    2. Analysis: Analyze various aspects, such as identifying actors who starred in the most popular movies, the impact of having the same writer, director, and producer on a movie, and whether independent producers create better movies.
    3. Rating Prediction: Predict the average rating of a movie based on factors such as overview, genres, and cast.
    4. Other Analysis: Perform other types of analysis to discover patterns in the movie industry.

    If you find this dataset useful, please upvote it!

  2. LinkedIn Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2021). LinkedIn Datasets [Dataset]. https://brightdata.com/products/datasets/linkedin
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 17, 2021
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features

    Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.

    Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases

    Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.

    Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.

  3. Large Customer Churn Analysis Dataset

    • kaggle.com
    zip
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hajra Amir (2024). Large Customer Churn Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/hajraamir21/large-customer-churn-analysis-dataset
    Explore at:
    zip(17387 bytes)Available download formats
    Dataset updated
    Dec 18, 2024
    Authors
    Hajra Amir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains synthetic data generated for customer churn analysis. It includes 1000 entries representing customer information, such as demographics, account details, subscription types, and churn status. The data is ideal for predictive modeling, machine learning algorithms, and exploratory data analysis (EDA). Features: CustomerID: A unique identifier for each customer. Gender: Male or Female. Age: Customer's age in years. Geography: Country or region of the customer (e.g., Germany, France, UK). Tenure: Number of months the customer has been with the company. Contract: Type of subscription (Month-to-month, One-year, Two-year). MonthlyCharges: The amount billed monthly. TotalCharges: The total amount billed to date. PaymentMethod: Method used for payments (e.g., Credit card, Direct debit). IsActiveMember: Whether the customer is an active member (1 = Active, 0 = Inactive). Churn: Indicates whether the customer has churned (Yes/No).

  4. m

    Student Skill Gap Analysis

    • data.mendeley.com
    • kaggle.com
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bindu Garg (2025). Student Skill Gap Analysis [Dataset]. http://doi.org/10.17632/rv6scbpd7v.1
    Explore at:
    Dataset updated
    Apr 28, 2025
    Authors
    Bindu Garg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is designed for skill gap analysis, focusing on evaluating the skill gap between students’ current skills and industry requirements. It provides insights into technical skills, soft skills, career interests, and challenges, helping in skill gap analysis to identify areas for improvement.

    By leveraging this dataset, educators, recruiters, and researchers can conduct skill gap analysis to assess students’ job readiness and tailor training programs accordingly. It serves as a valuable resource for identifying skill deficiencies and skill gaps improving career guidance, and enhancing curriculum design through targeted skill gap analysis.

    Following is the column descriptors: Name - Student's full name. email_id - Student's email address. Year - The academic year the student is currently in (e.g., 1st Year, 2nd Year, etc.). Current Course - The course the student is currently pursuing (e.g., B.Tech CSE, MBA, etc.). Technical Skills - List of technical skills possessed by the student (e.g., Python, Data Analysis, Cloud Computing). Programming Languages - Programming languages known by the student (e.g., Python, Java, C++). Rating - Self-assessed rating of technical skills on a scale of 1 to 5. Soft Skills - List of soft skills (e.g., Communication, Leadership, Teamwork). Rating - Self-assessed rating of soft skills on a scale of 1 to 5. Projects - Indicates whether the student has worked on any projects (Yes/No). Career Interest - The student's preferred career path (e.g., Data Scientist, Software Engineer). Challenges - Challenges faced while applying for jobs/internships (e.g., Lack of experience, Resume building issues).

  5. f

    Data from: Additive Hazards Regression Analysis of Massive Interval-Censored...

    • tandf.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peiyao Huang; Shuwei Li; Xinyuan Song (2025). Additive Hazards Regression Analysis of Massive Interval-Censored Data via Data Splitting [Dataset]. http://doi.org/10.6084/m9.figshare.27103243.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 12, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Peiyao Huang; Shuwei Li; Xinyuan Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the rapid development of data acquisition and storage space, massive datasets exhibited with large sample size emerge increasingly and make more advanced statistical tools urgently need. To accommodate such big volume in the analysis, a variety of methods have been proposed in the circumstances of complete or right censored survival data. However, existing development of big data methodology has not attended to interval-censored outcomes, which are ubiquitous in cross-sectional or periodical follow-up studies. In this work, we propose an easily implemented divide-and-combine approach for analyzing massive interval-censored survival data under the additive hazards model. We establish the asymptotic properties of the proposed estimator, including the consistency and asymptotic normality. In addition, the divide-and-combine estimator is shown to be asymptotically equivalent to the full-data-based estimator obtained from analyzing all data together. Simulation studies suggest that, relative to the full-data-based approach, the proposed divide-and-combine approach has desirable advantage in terms of computation time, making it more applicable to large-scale data analysis. An application to a set of interval-censored data also demonstrates the practical utility of the proposed method.

  6. c

    Fox News dataset is for analyzing media trends and narratives

    • crawlfeeds.com
    csv, zip
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Fox News dataset is for analyzing media trends and narratives [Dataset]. https://crawlfeeds.com/datasets/fox-news-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.

    Key Features of the Fox News Dataset

    • Extensive Coverage: Contains more than 1 million articles spanning various topics and events up to 2023.
    • Research-Ready: Perfect for text classification, natural language processing (NLP), and other research purposes.
    • Format: Provided in CSV format for seamless integration into analytical and research tools.

    Why Use This Dataset?

    This large dataset is ideal for:

    • Text Classification: Develop machine learning models to classify and categorize news content.
    • Natural Language Processing (NLP): Conduct sentiment analysis, keyword extraction, or topic modeling.
    • Media and Political Research: Analyze media narratives, public opinion, and political trends reflected in Fox News articles.
    • Trend Analysis: Identify shifts in public discourse and media focus over time.

    Explore More News Datasets

    Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.

    The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.

  7. Refined DataCo Supply Chain Geospatial Dataset

    • kaggle.com
    zip
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Om Gupta (2025). Refined DataCo Supply Chain Geospatial Dataset [Dataset]. https://www.kaggle.com/datasets/aaumgupta/refined-dataco-supply-chain-geospatial-dataset
    Explore at:
    zip(29010639 bytes)Available download formats
    Dataset updated
    Jan 29, 2025
    Authors
    Om Gupta
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Refined DataCo Smart Supply Chain Geospatial Dataset

    Optimized for Geospatial and Big Data Analysis

    This dataset is a refined and enhanced version of the original DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS dataset, specifically designed for advanced geospatial and big data analysis. It incorporates geocoded information, language translations, and cleaned data to enable applications in logistics optimization, supply chain visualization, and performance analytics.

    Key Features

    1. Geocoded Source and Destination Data

    • Accurate latitude and longitude coordinates for both source and destination locations.
    • Facilitates geospatial mapping, route analysis, and distance calculations.

    2. Supplementary GeoJSON Files

    • src_points.geojson: Source point geometries.
    • dest_points.geojson: Destination point geometries.
    • routes.geojson: Line geometries representing source-destination routes.
    • These files are compatible with GIS software and geospatial libraries such as GeoPandas, Folium, and QGIS.

    3. Language Translation

    • Key location fields (countries, states, and cities) are translated into English for consistency and global accessibility.

    4. Cleaned and Consolidated Data

    • Addressed missing values, removed duplicates, and corrected erroneous entries.
    • Ready-to-use dataset for analysis without additional preprocessing.

    5. Routes and Points Geometry

    • Enables the creation of spatial visualizations, hotspot identification, and route efficiency analyses.

    Applications

    1. Logistics Optimization

    • Analyze transportation routes and delivery performance to improve efficiency and reduce costs.

    2. Supply Chain Visualization

    • Create detailed maps to visualize the global flow of goods.

    3. Geospatial Modeling

    • Perform proximity analysis, clustering, and geospatial regression to uncover patterns in supply chain operations.

    4. Business Intelligence

    • Use the dataset for KPI tracking, decision-making, and operational insights.

    Dataset Content

    Files Included

    1. DataCoSupplyChainDatasetRefined.csv

      • The main dataset containing cleaned fields, geospatial coordinates, and English translations.
    2. src_points.geojson

      • GeoJSON file containing the source points for easy visualization and analysis.
    3. dest_points.geojson

      • GeoJSON file containing the destination points.
    4. routes.geojson

      • GeoJSON file with LineStrings representing routes between source and destination points.

    Attribution

    This dataset is based on the original dataset published by Fabian Constante, Fernando Silva, and António Pereira:
    Constante, Fabian; Silva, Fernando; Pereira, António (2019), “DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS”, Mendeley Data, V5, doi: 10.17632/8gx2fvg2k6.5.

    Refinements include geospatial processing, translation, and additional cleaning by the uploader to enhance usability and analytical potential.

    Tips for Using the Dataset

    • For geospatial analysis, leverage tools like GeoPandas, QGIS, or Folium to visualize routes and points.
    • Use the GeoJSON files for interactive mapping and spatial queries.
    • Combine this dataset with external datasets (e.g., road networks) for enriched analytics.

    This dataset is designed to empower data scientists, researchers, and business professionals to explore the intersection of geospatial intelligence and supply chain optimization.

  8. c

    Walmart Products Dataset – Free Product Data CSV

    • crawlfeeds.com
    csv, zip
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Walmart Products Dataset – Free Product Data CSV [Dataset]. https://crawlfeeds.com/datasets/walmart-products-free-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Dec 2, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Looking for a free Walmart product dataset? The Walmart Products Free Dataset delivers a ready-to-use ecommerce product data CSV containing ~2,100 verified product records from Walmart.com. It includes vital details like product titles, prices, categories, brand info, availability, and descriptions — perfect for data analysis, price comparison, market research, or building machine-learning models.

    Key Features

    Complete Product Metadata: Each entry includes URL, title, brand, SKU, price, currency, description, availability, delivery method, average rating, total ratings, image links, unique ID, and timestamp.

    CSV Format, Ready to Use: Download instantly - no need for scraping, cleaning or formatting.

    Good for E-commerce Research & ML: Ideal for product cataloging, price tracking, demand forecasting, recommendation systems, or data-driven projects.

    Free & Easy Access: Priced at USD $0.0, making it a great starting point for developers, data analysts or students.

    Who Benefits?

    • Data analysts & researchers exploring e-commerce trends or product catalog data.
    • Developers & data scientists building price-comparison tools, recommendation engines or ML models.
    • E-commerce strategists/marketers need product metadata for competitive analysis or market research.
    • Students/hobbyists needing a free dataset for learning or demo projects.

    Why Use This Dataset Instead of Manual Scraping?

    • Time-saving: No need to write scrapers or deal with rate limits.
    • Clean, structured data: All records are verified and already formatted in CSV, saving hours of cleaning.
    • Risk-free: Avoid Terms-of-Service issues or IP blocks that come with manual scraping.
      Instant access: Free and immediately downloadable.
  9. E

    Exploratory Data Analysis (EDA) Tools Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54257
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing need for businesses to derive actionable insights from their ever-expanding datasets. The market, currently estimated at $15 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching an estimated $45 billion by 2033. This growth is fueled by several factors, including the rising adoption of big data analytics, the proliferation of cloud-based solutions offering enhanced accessibility and scalability, and the growing demand for data-driven decision-making across diverse industries like finance, healthcare, and retail. The market is segmented by application (large enterprises and SMEs) and type (graphical and non-graphical tools), with graphical tools currently holding a larger market share due to their user-friendly interfaces and ability to effectively communicate complex data patterns. Large enterprises are currently the dominant segment, but the SME segment is anticipated to experience faster growth due to increasing affordability and accessibility of EDA solutions. Geographic expansion is another key driver, with North America currently holding the largest market share due to early adoption and a strong technological ecosystem. However, regions like Asia-Pacific are exhibiting high growth potential, fueled by rapid digitalization and a burgeoning data science talent pool. Despite these opportunities, the market faces certain restraints, including the complexity of some EDA tools requiring specialized skills and the challenge of integrating EDA tools with existing business intelligence platforms. Nonetheless, the overall market outlook for EDA tools remains highly positive, driven by ongoing technological advancements and the increasing importance of data analytics across all sectors. The competition among established players like IBM Cognos Analytics and Altair RapidMiner, and emerging innovative companies like Polymer Search and KNIME, further fuels market dynamism and innovation.

  10. m

    The RHMCD-20 datasets for Depression and Mental Health Data Analysis with...

    • data.mendeley.com
    Updated Dec 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Imrus Salehin (2023). The RHMCD-20 datasets for Depression and Mental Health Data Analysis with Machine Learning [Dataset]. http://doi.org/10.17632/pxjmjyfdh2.1
    Explore at:
    Dataset updated
    Dec 18, 2023
    Authors
    Imrus Salehin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    the RHMCD-20 dataset, we took care to include information from a wide range of sources, including teenagers from Bangladesh, college students, housewives, professionals from businesses and corporations, and other people.This is survey data for Depression and Mental Health Data Analysis. Survey questions : Age: Represents the age of the participants. Gender: Indicates the gender of the participants. Occupation: Represents the participant's occupations. Days_Indoors :Indicates the number of days the participant has not been out of the house Growing_Stress: Indicates the participant's stress is increasing day by day (Yes/No). Quarantine_Frustration: Frustrations in the first two weeks of quarantine (Yes/Maybe/No). Changes_Habits: Represents major changes in eating habits and sleeping (Yes/Maybe/No). Mental_Health_History : A precedent of mental disorders in the previous generation (Yes/No). Weight_Change :Highlights changes in body weight during quarantine (Yes/Maybe/No) Mood_Swings: Represents extreme mood changes (Low/Medium/High). Coping_Struggles: The inability to cope with daily problems or stress (Yes/Maybe/No). Work_Interest :Represents whether the participant is losing interest in working (Yes/No). Social_Weakness :Conveys feeling mentally weak when interacting with others (Yes/No).

  11. A

    Artificial Intelligence Training Dataset Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Artificial Intelligence Training Dataset Report [Dataset]. https://www.archivemarketresearch.com/reports/artificial-intelligence-training-dataset-38645
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Artificial Intelligence (AI) Training Dataset market is projected to reach $1605.2 million by 2033, exhibiting a CAGR of 9.4% from 2025 to 2033. The surge in demand for AI training datasets is driven by the increasing adoption of AI and machine learning technologies in various industries such as healthcare, financial services, and manufacturing. Moreover, the growing need for reliable and high-quality data for training AI models is further fueling the market growth. Key market trends include the increasing adoption of cloud-based AI training datasets, the emergence of synthetic data generation, and the growing focus on data privacy and security. The market is segmented by type (image classification dataset, voice recognition dataset, natural language processing dataset, object detection dataset, and others) and application (smart campus, smart medical, autopilot, smart home, and others). North America is the largest regional market, followed by Europe and Asia Pacific. Key companies operating in the market include Appen, Speechocean, TELUS International, Summa Linguae Technologies, and Scale AI. Artificial Intelligence (AI) training datasets are critical for developing and deploying AI models. These datasets provide the data that AI models need to learn, and the quality of the data directly impacts the performance of the model. The AI training dataset market landscape is complex, with many different providers offering datasets for a variety of applications. The market is also rapidly evolving, as new technologies and techniques are developed for collecting, labeling, and managing AI training data.

  12. d

    Data from: Inferring complex phylogenies using parsimony: an empirical...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Douglas E. Soltis; Pamela S. Soltis; Mark E. Mort; Mark W. Chase; Vincent Savolainen; Sara B. Hoot; Cynthia M. Morton (2025). Inferring complex phylogenies using parsimony: an empirical approach using three large DNA data sets for angiosperms [Dataset]. http://doi.org/10.5061/dryad.64
    Explore at:
    Dataset updated
    Jul 6, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Douglas E. Soltis; Pamela S. Soltis; Mark E. Mort; Mark W. Chase; Vincent Savolainen; Sara B. Hoot; Cynthia M. Morton
    Time period covered
    Feb 22, 2008
    Description

    To explore the feasibility of parsimony analysis for large data sets, we conducted heuristic parsimony searches and bootstrap analyses on separate and combined DNA data sets for 190 angiosperms and three outgroups. Separate data sets of 18S rDNA (1,855 bp), rbc L (1,428 bp), and atp B (1,450 bp) sequences were combined into a single matrix 4,733 bp in length. Analyses of the combined data set show great improvements in computer run times compared to those of the separate data sets and of the data sets combined in pairs. Six searches of the 18S rDNA rbc L atp B data set were conducted; in all cases TBR branch swapping was completed, generally within a few days. In contrast, TBR branch swapping was not completed for any of the three separate data sets, or for the pairwise combined data sets. These results illustrate that it is possible to conduct a thorough search of tree space with large data sets, given sufficient signal. In this case, and probably most others, sufficient signal for a ...

  13. f

    Assessment and Improvement of Statistical Tools for Comparative Proteomics...

    • acs.figshare.com
    • figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Veit Schwämmle; Ileana Rodríguez León; Ole Nørregaard Jensen (2023). Assessment and Improvement of Statistical Tools for Comparative Proteomics Analysis of Sparse Data Sets with Few Experimental Replicates [Dataset]. http://doi.org/10.1021/pr400045u.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    ACS Publications
    Authors
    Veit Schwämmle; Ileana Rodríguez León; Ole Nørregaard Jensen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Large-scale quantitative analyses of biological systems are often performed with few replicate experiments, leading to multiple nonidentical data sets due to missing values. For example, mass spectrometry driven proteomics experiments are frequently performed with few biological or technical replicates due to sample-scarcity or due to duty-cycle or sensitivity constraints, or limited capacity of the available instrumentation, leading to incomplete results where detection of significant feature changes becomes a challenge. This problem is further exacerbated for the detection of significant changes on the peptide level, for example, in phospho-proteomics experiments. In order to assess the extent of this problem and the implications for large-scale proteome analysis, we investigated and optimized the performance of three statistical approaches by using simulated and experimental data sets with varying numbers of missing values. We applied three tools, including standard t test, moderated t test, also known as limma, and rank products for the detection of significantly changing features in simulated and experimental proteomics data sets with missing values. The rank product method was improved to work with data sets containing missing values. Extensive analysis of simulated and experimental data sets revealed that the performance of the statistical analysis tools depended on simple properties of the data sets. High-confidence results were obtained by using the limma and rank products methods for analyses of triplicate data sets that exhibited more than 1000 features and more than 50% missing values. The maximum number of differentially represented features was identified by using limma and rank products methods in a complementary manner. We therefore recommend combined usage of these methods as a novel and optimal way to detect significantly changing features in these data sets. This approach is suitable for large quantitative data sets from stable isotope labeling and mass spectrometry experiments and should be applicable to large data sets of any type. An R script that implements the improved rank products algorithm and the combined analysis is available.

  14. c

    Flipkart reviews large dataset

    • crawlfeeds.com
    csv, zip
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Flipkart reviews large dataset [Dataset]. https://crawlfeeds.com/datasets/flipkart-reviews-large-dataset
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Flipkart Reviews Large Dataset is a comprehensive collection of 1.86 million customer reviews from Flipkart, one of India's largest e-commerce platforms. Available in CSV format, this dataset is ideal for conducting sentiment analysis, understanding consumer preferences, and developing machine learning models.

    For a more extensive dataset, consider the Flipkart E-commerce Dataset, which offers detailed information on over 5.7 million products, including names, descriptions, prices, customer reviews, ratings, and images. This dataset is invaluable for data analysis, machine learning projects, and in-depth market research.

    Whether you're looking to enhance recommendation systems, perform market research, or analyze customer feedback trends, this dataset offers a wealth of information.

    Use Cases:

    • Sentiment Analysis: Analyze customer sentiments to understand product reception.
    • Recommendation Systems: Build models to recommend products based on customer feedback.
    • Consumer Behavior Analysis: Study purchasing patterns and preferences across different product categories.
    • Market Research: Gain insights into market trends and customer opinions for various products.
  15. n

    DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS

    • narcis.nl
    • data.mendeley.com
    • +1more
    Updated Mar 13, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Constante, F (via Mendeley Data) (2019). DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS [Dataset]. http://doi.org/10.17632/8gx2fvg2k6.5
    Explore at:
    Dataset updated
    Mar 13, 2019
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Constante, F (via Mendeley Data)
    Description

    A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.

    Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)

    Types of Products : Clothing , Sports , and Electronic Supplies

    Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.

  16. N

    Comprehensive Median Household Income and Distribution Dataset for Big...

    • neilsberg.com
    Updated Jan 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Comprehensive Median Household Income and Distribution Dataset for Big Sandy, TX: Analysis by Household Type, Size and Income Brackets [Dataset]. https://www.neilsberg.com/research/datasets/cd8b6e24-b041-11ee-aaca-3860777c1fe6/
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Texas, Big Sandy
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the median household income in Big Sandy. It can be utilized to understand the trend in median household income and to analyze the income distribution in Big Sandy by household type, size, and across various income brackets.

    Content

    The dataset will have the following datasets when applicable

    Please note: The 2020 1-Year ACS estimates data was not reported by the Census Bureau due to the impact on survey collection and analysis caused by COVID-19. Consequently, median household income data for 2020 is unavailable for large cities (population 65,000 and above).

    • Big Sandy, TX Median Household Income Trends (2010-2021, in 2022 inflation-adjusted dollars)
    • Median Household Income Variation by Family Size in Big Sandy, TX: Comparative analysis across 7 household sizes
    • Income Distribution by Quintile: Mean Household Income in Big Sandy, TX
    • Big Sandy, TX households by income brackets: family, non-family, and total, in 2022 inflation-adjusted dollars

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Interested in deeper insights and visual analysis?

    Explore our comprehensive data analysis and visual representations for a deeper understanding of Big Sandy median household income. You can refer the same here

  17. Cloud Analytics Market Analysis North America, Europe, APAC, Middle East and...

    • technavio.com
    pdf
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2024). Cloud Analytics Market Analysis North America, Europe, APAC, Middle East and Africa, South America - US, China, UK, Germany, Japan - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/cloud-analytics-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2024 - 2028
    Description

    Snapshot img

    Cloud Analytics Market Size 2024-2028

    The cloud analytics market size is forecast to increase by USD 74.08 billion at a CAGR of 24.4% between 2023 and 2028.

    The market is experiencing significant growth due to several key trends. The adoption of hybrid and multi-cloud setups is on the rise, as these configurations enhance data connectivity and flexibility. Another trend driving market growth is the increasing use of cloud security applications to safeguard sensitive data.
    However, concerns regarding confidential data security and privacy remain a challenge for market growth. Organizations must ensure robust security measures are in place to mitigate risks and maintain trust with their customers. Overall, the market is poised for continued expansion as businesses seek to leverage the benefits of cloud technologies for data processing and data analytics.
    

    What will be the Size of the Cloud Analytics Market During the Forecast Period?

    Request Free Sample

    The market is experiencing significant growth due to the increasing volume of data generated by businesses and the demand for advanced analytics solutions. Cloud-based analytics enables organizations to process and analyze large datasets from various data sources, including unstructured data, in real-time. This is crucial for businesses looking to make data-driven decisions and gain valuable insights to optimize their operations and meet customer requirements. Key industries such as sales and marketing, customer service, and finance are adopting cloud analytics to improve key performance indicators and gain a competitive edge. Both Small and Medium-sized Enterprises (SMEs) and large enterprises are embracing cloud analytics, with solutions available on private, public, and multi-cloud platforms.
    Big data technology, such as machine learning and artificial intelligence, are integral to cloud analytics, enabling advanced data analytics and business intelligence. Cloud analytics provides businesses with the flexibility to store and process data In the cloud, reducing the need for expensive on-premises data storage and computation. Hybrid environments are also gaining popularity, allowing businesses to leverage the benefits of both private and public clouds. Overall, the market is poised for continued growth as businesses increasingly rely on data-driven insights to inform their decision-making processes.
    

    How is this Cloud Analytics Industry segmented and which is the largest segment?

    The cloud analytics industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2024-2028, as well as historical data from 2017-2022 for the following segments.

    Solution
    
      Hosted data warehouse solutions
      Cloud BI tools
      Complex event processing
      Others
    
    
    Deployment
    
      Public cloud
      Hybrid cloud
      Private cloud
    
    
    Geography
    
      North America
    
        US
    
    
      Europe
    
        Germany
        UK
    
    
      APAC
    
        China
        Japan
    
    
      Middle East and Africa
    
    
    
      South America
    

    By Solution Insights

    The hosted data warehouse solutions segment is estimated to witness significant growth during the forecast period.
    

    Hosted data warehouses enable organizations to centralize and analyze large datasets from multiple sources, facilitating advanced analytics solutions and real-time insights. By utilizing cloud-based infrastructure, businesses can reduce operational costs through eliminating licensing expenses, hardware investments, and maintenance fees. Additionally, cloud solutions offer network security measures, such as Software Defined Networking and Network integration, ensuring data protection. Cloud analytics caters to diverse industries, including SMEs and large enterprises, addressing requirements for sales and marketing, customer service, and key performance indicators. Advanced analytics capabilities, including predictive analytics, automated decision making, and fraud prevention, are essential for data-driven decision making and business optimization.

    Furthermore, cloud platforms provide access to specialized talent, big data technology, and AI, enhancing customer experiences and digital business opportunities. Data connectivity and data processing in real-time are crucial for network agility and application performance. Hosted data warehouses offer computational power and storage capabilities, ensuring efficient data utilization and enterprise information management. Cloud service providers offer various cloud environments, including private, public, multi-cloud, and hybrid, catering to diverse business needs. Compliance and security concerns are addressed through cybersecurity frameworks and data security measures, ensuring data breaches and thefts are minimized.

    Get a glance at the Cloud Analytics Industry report of share of various segments Request Free Sample

    The Hosted data warehouse solutions s

  18. N

    Comprehensive Median Household Income and Distribution Dataset for Big Bend...

    • neilsberg.com
    Updated Jan 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Comprehensive Median Household Income and Distribution Dataset for Big Bend Town, Wisconsin: Analysis by Household Type, Size and Income Brackets [Dataset]. https://www.neilsberg.com/research/datasets/cd8b5c39-b041-11ee-aaca-3860777c1fe6/
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Big Bend, Wisconsin
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the median household income in Big Bend town. It can be utilized to understand the trend in median household income and to analyze the income distribution in Big Bend town by household type, size, and across various income brackets.

    Content

    The dataset will have the following datasets when applicable

    Please note: The 2020 1-Year ACS estimates data was not reported by the Census Bureau due to the impact on survey collection and analysis caused by COVID-19. Consequently, median household income data for 2020 is unavailable for large cities (population 65,000 and above).

    • Big Bend Town, Wisconsin Median Household Income Trends (2010-2021, in 2022 inflation-adjusted dollars)
    • Median Household Income Variation by Family Size in Big Bend Town, Wisconsin: Comparative analysis across 7 household sizes
    • Income Distribution by Quintile: Mean Household Income in Big Bend Town, Wisconsin
    • Big Bend Town, Wisconsin households by income brackets: family, non-family, and total, in 2022 inflation-adjusted dollars

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Interested in deeper insights and visual analysis?

    Explore our comprehensive data analysis and visual representations for a deeper understanding of Big Bend town median household income. You can refer the same here

  19. Big data and business analytics revenue worldwide 2015-2022

    • statista.com
    Updated Aug 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2021). Big data and business analytics revenue worldwide 2015-2022 [Dataset]. https://www.statista.com/statistics/551501/worldwide-big-data-business-analytics-revenue/
    Explore at:
    Dataset updated
    Aug 17, 2021
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The global big data and business analytics (BDA) market was valued at ***** billion U.S. dollars in 2018 and is forecast to grow to ***** billion U.S. dollars by 2021. In 2021, more than half of BDA spending will go towards services. IT services is projected to make up around ** billion U.S. dollars, and business services will account for the remainder. Big data High volume, high velocity and high variety: one or more of these characteristics is used to define big data, the kind of data sets that are too large or too complex for traditional data processing applications. Fast-growing mobile data traffic, cloud computing traffic, as well as the rapid development of technologies such as artificial intelligence (AI) and the Internet of Things (IoT) all contribute to the increasing volume and complexity of data sets. For example, connected IoT devices are projected to generate **** ZBs of data in 2025. Business analytics Advanced analytics tools, such as predictive analytics and data mining, help to extract value from the data and generate business insights. The size of the business intelligence and analytics software application market is forecast to reach around **** billion U.S. dollars in 2022. Growth in this market is driven by a focus on digital transformation, a demand for data visualization dashboards, and an increased adoption of cloud.

  20. Z

    Cloud-based User Entity Behavior Analytics Log Data Set

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Landauer, Max; Skopik, Florian; Höld, Georg; Wurzenberger, Markus (2023). Cloud-based User Entity Behavior Analytics Log Data Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7119952
    Explore at:
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    AIT Austrian Institute of Technology
    Authors
    Landauer, Max; Skopik, Florian; Höld, Georg; Wurzenberger, Markus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This respository contains the CLUE-LDS (CLoud-based User Entity behavior analytics Log Data Set). The data set contains log events from real users utilizing a cloud storage suitable for User Entity Behavior Analytics (UEBA). Events include logins, file accesses, link shares, config changes, etc. The data set contains around 50 million events generated by more than 5000 distinct users in more than five years (2017-07-07 to 2022-09-29 or 1910 days). The data set is complete except for 109 events missing on 2021-04-22, 2021-08-20, and 2021-09-05 due to database failure. The unpacked file size is around 14.5 GB. A detailed analysis of the data set is provided in [1]. The logs are provided in JSON format with the following attributes in the first level:

    id: Unique log line identifier that starts at 1 and increases incrementally, e.g., 1. time: Time stamp of the event in ISO format, e.g., 2021-01-01T00:00:02Z. uid: Unique anonymized identifier for the user generating the event, e.g., old-pink-crane-sharedealer. uidType: Specifier for uid, which is either the user name or IP address for logged out users. type: The action carried out by the user, e.g., file_accessed. params: Additional event parameters (e.g., paths, groups) stored in a nested dictionary. isLocalIP: Optional flag for event origin, which is either internal (true) or external (false). role: Optional user role: consulting, administration, management, sales, technical, or external. location: Optional IP-based geolocation of event origin, including city, country, longitude, latitude, etc. In the following data sample, the first object depicts a successful user login (see type: login_successful) and the second object depicts a file access (see type: file_accessed) from a remote location:

    {"params": {"user": "intact-gray-marlin-trademarkagent"}, "type": "login_successful", "time": "2019-11-14T11:26:43Z", "uid": "intact-gray-marlin-trademarkagent", "id": 21567530, "uidType": "name"}

    {"isLocalIP": false, "params": {"path": "/proud-copper-orangutan-artexer/doubtful-plum-ptarmigan-merchant/insufficient-amaranth-earthworm-qualitycontroller/curious-silver-galliform-tradingstandards/incredible-indigo-octopus-printfinisher/wicked-bronze-sloth-claimsmanager/frantic-aquamarine-horse-cleric"}, "type": "file_accessed", "time": "2019-11-14T11:26:51Z", "uid": "graceful-olive-spoonbill-careersofficer", "id": 21567531, "location": {"countryCode": "AT", "countryName": "Austria", "region": "4", "city": "Gmunden", "latitude": 47.915, "longitude": 13.7959, "timezone": "Europe/Vienna", "postalCode": "4810", "metroCode": null, "regionName": "Upper Austria", "isInEuropeanUnion": true, "continent": "Europe", "accuracyRadius": 50}, "uidType": "ipaddress"} The data set was generated at the premises of Huemer Group, a midsize IT service provider located in Vienna, Austria. Huemer Group offers a range of Infrastructure-as-a-Service solutions for enterprises, including cloud computing and storage. In particular, their cloud storage solution called hBOX enables customers to upload their data, synchronize them with multiple devices, share files with others, create versions and backups of their documents, collaborate with team members in shared data spaces, and query the stored documents using search terms. The hBOX extends the open-source project Nextcloud with interfaces and functionalities tailored to the requirements of customers. The data set comprises only normal user behavior, but can be used to evaluate anomaly detection approaches by simulating account hijacking. We provide an implementation for identifying similar users, switching pairs of users to simulate changes of behavior patterns, and a sample detection approach in our github repo. Acknowledgements: Partially funded by the FFG project DECEPT (873980). The authors thank Walter Huemer, Oskar Kruschitz, Kevin Truckenthanner, and Christian Aigner from Huemer Group for supporting the collection of the data set. If you use the dataset, please cite the following publication: [1] M. Landauer, F. Skopik, G. Höld, and M. Wurzenberger. "A User and Entity Behavior Analytics Log Data Set for Anomaly Detection in Cloud Computing". 2022 IEEE International Conference on Big Data - 6th International Workshop on Big Data Analytics for Cyber Intelligence and Defense (BDA4CID 2022), December 17-20, 2022, Osaka, Japan. IEEE. [PDF]

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bharat Kumar0925 (2024). TMDB movies clean dataset [Dataset]. https://www.kaggle.com/datasets/bharatkumar0925/tmdb-movies-clean-dataset
Organization logo

TMDB movies clean dataset

Large clean data of TMDB movies

Explore at:
zip(266877093 bytes)Available download formats
Dataset updated
Sep 6, 2024
Authors
Bharat Kumar0925
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset Description

This dataset contains two files: Large_movies_data.csv and large_movies_clean.csv. The data is taken from the TMDB dataset. Originally, it contained around 900,000 movies, but some movies were dropped for recommendation purposes. Specifically, movies missing an overview were removed since the overview is one of the most important columns for analysis.

Column Description:

Large_movies_data.csv:

  • Id: Unique identifier for each movie.
  • Title: The title of the movie.
  • Overview: A brief description of the movie.
  • Genres: The genres associated with the movie.
  • Cast: The main actors in the movie.
  • Director: The director of the movie.
  • Writers: The screenwriters of the movie.
  • Production_companies: Companies involved in producing the movie.
  • Producers: Producers of the movie.
  • Original_language: The original language of the movie.
  • Vote_count: Number of votes the movie has received.
  • Vote_average: Average rating based on user votes.
  • Popularity: Popularity score of the movie.
  • Runtime: Duration of the movie in minutes.
  • Release_date: The release date of the movie.

Total movies in Large_movies_data.csv: 663,828.

Large_movies_clean.csv:

This file is a cleaned version with unnecessary columns removed, text converted to lowercase, and many symbols removed (though some may still remain). If you find that certain features are missing, you can use the original Large_movies_data.csv.

Columns in large_movies_clean.csv: - Id: Unique identifier for each movie. - Title: The title of the movie. - Tags: Combined information from the overview, genres, and other textual columns. - Original_language: The original language of the movie. - Vote_count: Number of votes the movie has received. - Vote_average: Average rating based on user votes. - Year: Year extracted from the release date. - Month: Month extracted from the release date.

Possible Use Cases:

  1. Recommendation System: A robust recommendation system can be built using this large dataset.
  2. Analysis: Analyze various aspects, such as identifying actors who starred in the most popular movies, the impact of having the same writer, director, and producer on a movie, and whether independent producers create better movies.
  3. Rating Prediction: Predict the average rating of a movie based on factors such as overview, genres, and cast.
  4. Other Analysis: Perform other types of analysis to discover patterns in the movie industry.

If you find this dataset useful, please upvote it!

Search
Clear search
Close search
Google apps
Main menu