77 datasets found
  1. Human Resources Data Set

    • kaggle.com
    Updated Oct 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Rich (2020). Human Resources Data Set [Dataset]. http://doi.org/10.34740/kaggle/dsv/1572001
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dr. Rich
    Description

    Updated 30 January 2023

    Version 14 of Dataset

    License Update:

    There has been some confusion around licensing for this data set. Dr. Carla Patalano and Dr. Rich Huebner are the original authors of this dataset.

    We provide a license to anyone who wishes to use this dataset for learning or teaching. For the purposes of sharing, please follow this license:

    CC-BY-NC-ND This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

    Codebook

    https://rpubs.com/rhuebner/hrd_cb_v14

    PLEASE NOTE -- I recently updated the codebook - please use the above link. A few minor discrepancies were identified between the codebook and the dataset. Please feel free to contact me through LinkedIn (www.linkedin.com/in/RichHuebner) to report discrepancies and make requests.

    Context

    HR data can be hard to come by, and HR professionals generally lag behind with respect to analytics and data visualization competency. Thus, Dr. Carla Patalano and I set out to create our own HR-related dataset, which is used in one of our graduate MSHRM courses called HR Metrics and Analytics, at New England College of Business. We created this data set ourselves. We use the data set to teach HR students how to use and analyze the data in Tableau Desktop - a data visualization tool that's easy to learn.

    This version provides a variety of features that are useful for both data visualization AND creating machine learning / predictive analytics models. We are working on expanding the data set even further by generating even more records and a few additional features. We will be keeping this as one file/one data set for now. There is a possibility of creating a second file perhaps down the road where you can join the files together to practice SQL/joins, etc.

    Note that this dataset isn't perfect. By design, there are some issues that are present. It is primarily designed as a teaching data set - to teach human resources professionals how to work with data and analytics.

    Content

    We have reduced the complexity of the dataset down to a single data file (v14). The CSV revolves around a fictitious company and the core data set contains names, DOBs, age, gender, marital status, date of hire, reasons for termination, department, whether they are active or terminated, position title, pay rate, manager name, and performance score.

    Recent additions to the data include: - Absences - Most Recent Performance Review Date - Employee Engagement Score

    Acknowledgements

    Dr. Carla Patalano provided the baseline idea for creating this synthetic data set, which has been used now by over 200 Human Resource Management students at the college. Students in the course learn data visualization techniques with Tableau Desktop and use this data set to complete a series of assignments.

    Inspiration

    We've included some open-ended questions that you can explore and try to address through creating Tableau visualizations, or R or Python analyses. Good luck and enjoy the learning!

    • Is there any relationship between who a person works for and their performance score?
    • What is the overall diversity profile of the organization?
    • What are our best recruiting sources if we want to ensure a diverse organization?
    • Can we predict who is going to terminate and who isn't? What level of accuracy can we achieve on this?
    • Are there areas of the company where pay is not equitable?

    There are so many other interesting questions that could be addressed through this interesting data set. Dr. Patalano and I look forward to seeing what we can come up with.

    If you have any questions or comments about the dataset, please do not hesitate to reach out to me on LinkedIn: http://www.linkedin.com/in/RichHuebner

    You can also reach me via email at: Richard.Huebner@go.cambridgecollege.edu

  2. f

    Table_1_Raw Data Visualization for Common Factorial Designs Using SPSS: A...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Loffing (2023). Table_1_Raw Data Visualization for Common Factorial Designs Using SPSS: A Syntax Collection and Tutorial.XLSX [Dataset]. http://doi.org/10.3389/fpsyg.2022.808469.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Frontiers
    Authors
    Florian Loffing
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Transparency in data visualization is an essential ingredient for scientific communication. The traditional approach of visualizing continuous quantitative data solely in the form of summary statistics (i.e., measures of central tendency and dispersion) has repeatedly been criticized for not revealing the underlying raw data distribution. Remarkably, however, systematic and easy-to-use solutions for raw data visualization using the most commonly reported statistical software package for data analysis, IBM SPSS Statistics, are missing. Here, a comprehensive collection of more than 100 SPSS syntax files and an SPSS dataset template is presented and made freely available that allow the creation of transparent graphs for one-sample designs, for one- and two-factorial between-subject designs, for selected one- and two-factorial within-subject designs as well as for selected two-factorial mixed designs and, with some creativity, even beyond (e.g., three-factorial mixed-designs). Depending on graph type (e.g., pure dot plot, box plot, and line plot), raw data can be displayed along with standard measures of central tendency (arithmetic mean and median) and dispersion (95% CI and SD). The free-to-use syntax can also be modified to match with individual needs. A variety of example applications of syntax are illustrated in a tutorial-like fashion along with fictitious datasets accompanying this contribution. The syntax collection is hoped to provide researchers, students, teachers, and others working with SPSS a valuable tool to move towards more transparency in data visualization.

  3. Customer360Insights

    • kaggle.com
    Updated Jun 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dave Darshan (2024). Customer360Insights [Dataset]. https://www.kaggle.com/datasets/davedarshan/customer360insights
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dave Darshan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Customer360Insights

    The Customer360Insights dataset is a synthetic collection meticulously designed to mirror the multifaceted nature of customer interactions within an e-commerce platform. It encompasses a wide array of variables, each serving as a pillar to support various analytical explorations. Here’s a breakdown of the dataset and the potential analyses it enables:

    Dataset Description

    • Customer Demographics: Includes FullName, Gender, Age, CreditScore, and MonthlyIncome. These variables provide a demographic snapshot of the customer base, allowing for segmentation and targeted marketing analysis.
    • Geographical Data: Comprising Country, State, and City, this section facilitates location-based analytics, market penetration studies, and regional sales performance.
    • Product Information: Details like Category, Product, Cost, and Price enable product trend analysis, profitability assessment, and inventory optimization.
    • Transactional Data: Captures the customer journey through SessionStart, CartAdditionTime, OrderConfirmation, OrderConfirmationTime, PaymentMethod, and SessionEnd. This rich temporal data can be used for funnel analysis, conversion rate optimization, and customer behavior modeling.
    • Post-Purchase Details: With OrderReturn and ReturnReason, analysts can delve into return rate calculations, post-purchase satisfaction, and quality control.

    Types of Analysis

    • Descriptive Analytics: Understand basic metrics like average monthly income, most common product categories, and typical credit scores.
    • Predictive Analytics: Use machine learning to predict credit risk or the likelihood of a purchase based on demographics and session activity.
    • Customer Segmentation: Group customers by demographics or purchasing behavior to tailor marketing strategies.
    • Geospatial Analysis: Examine sales distribution across different regions and optimize logistics. Time Series Analysis: Study the seasonality of purchases and session activities over time.
    • Funnel Analysis: Evaluate the customer journey from session start to order confirmation and identify drop-off points.
    • Cohort Analysis: Track customer cohorts over time to understand retention and repeat purchase patterns.
    • Market Basket Analysis: Discover product affinities and develop cross-selling strategies.

    This dataset is a playground for data enthusiasts to practice cleaning, transforming, visualizing, and modeling data. Whether you’re conducting A/B testing for marketing campaigns, forecasting sales, or building customer profiles, Customer360Insights offers a rich, realistic dataset for honing your data science skills.

    Curious about how I created the data? Feel free to click here and take a peek! 😉

    📊🔍 Good Luck and Happy Analysing 🔍📊

  4. d

    Dataplex: All CMS Data Feeds | Access 1519 Reports & 26B+ Rows of Data |...

    • datarade.ai
    .csv
    Updated Aug 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: All CMS Data Feeds | Access 1519 Reports & 26B+ Rows of Data | Perfect for Historical Analysis & Easy Ingestion [Dataset]. https://datarade.ai/data-products/dataplex-all-cms-data-feeds-access-1519-reports-26b-row-dataplex
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Aug 14, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    United States of America
    Description

    The All CMS Data Feeds dataset is an expansive resource offering access to 118 unique report feeds, providing in-depth insights into various aspects of the U.S. healthcare system. With over 25.8 billion rows of data meticulously collected since 2007, this dataset is invaluable for healthcare professionals, analysts, researchers, and businesses seeking to understand and analyze healthcare trends, performance metrics, and demographic shifts over time. The dataset is updated monthly, ensuring that users always have access to the most current and relevant data available.

    Dataset Overview:

    118 Report Feeds: - The dataset includes a wide array of report feeds, each providing unique insights into different dimensions of healthcare. These topics range from Medicare and Medicaid service metrics, patient demographics, provider information, financial data, and much more. The breadth of information ensures that users can find relevant data for nearly any healthcare-related analysis. - As CMS releases new report feeds, they are automatically added to this dataset, keeping it current and expanding its utility for users.

    25.8 Billion Rows of Data:

    • With over 25.8 billion rows of data, this dataset provides a comprehensive view of the U.S. healthcare system. This extensive volume of data allows for granular analysis, enabling users to uncover insights that might be missed in smaller datasets. The data is also meticulously cleaned and aligned, ensuring accuracy and ease of use.

    Historical Data Since 2007: - The dataset spans from 2007 to the present, offering a rich historical perspective that is essential for tracking long-term trends and changes in healthcare delivery, policy impacts, and patient outcomes. This historical data is particularly valuable for conducting longitudinal studies and evaluating the effects of various healthcare interventions over time.

    Monthly Updates:

    • To ensure that users have access to the most current information, the dataset is updated monthly. These updates include new reports as well as revisions to existing data, making the dataset a continuously evolving resource that stays relevant and accurate.

    Data Sourced from CMS:

    • The data in this dataset is sourced directly from the Centers for Medicare & Medicaid Services (CMS). After collection, the data is meticulously cleaned and its attributes are aligned, ensuring consistency, accuracy, and ease of use for any application. Furthermore, any new updates or releases from CMS are automatically integrated into the dataset, keeping it comprehensive and current.

    Use Cases:

    Market Analysis:

    • The dataset is ideal for market analysts who need to understand the dynamics of the healthcare industry. The extensive historical data allows for detailed segmentation and analysis, helping users identify trends, market shifts, and growth opportunities. The comprehensive nature of the data enables users to perform in-depth analyses of specific market segments, making it a valuable tool for strategic decision-making.

    Healthcare Research:

    • Researchers will find the All CMS Data Feeds dataset to be a robust foundation for academic and commercial research. The historical data, combined with the breadth of coverage across various healthcare metrics, supports rigorous, in-depth analysis. Researchers can explore the effects of healthcare policies, study patient outcomes, analyze provider performance, and more, all within a single, comprehensive dataset.

    Performance Tracking:

    • Healthcare providers and organizations can use the dataset to track performance metrics over time. By comparing data across different periods, organizations can identify areas for improvement, monitor the effectiveness of initiatives, and ensure compliance with regulatory standards. The dataset provides the detailed, reliable data needed to track and analyze key performance indicators.

    Compliance and Regulatory Reporting:

    • The dataset is also an essential tool for compliance officers and those involved in regulatory reporting. With detailed data on provider performance, patient outcomes, and healthcare utilization, the dataset helps organizations meet regulatory requirements, prepare for audits, and ensure adherence to best practices. The accuracy and comprehensiveness of the data make it a trusted resource for regulatory compliance.

    Data Quality and Reliability:

    The All CMS Data Feeds dataset is designed with a strong emphasis on data quality and reliability. Each row of data is meticulously cleaned and aligned, ensuring that it is both accurate and consistent. This attention to detail makes the dataset a trusted resource for high-stakes applications, where data quality is critical.

    Integration and Usability:

    Ease of Integration:

    • The dataset is provided in a CSV format, which is widely compatible with most data analysis tools and platforms. This ensures that users can easily integrate the data into their existing wo...
  5. d

    NFL Data (Historic Data Available) - Sports Data, National Football League...

    • datarade.ai
    Updated Sep 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    APISCRAPY (2024). NFL Data (Historic Data Available) - Sports Data, National Football League Datasets. Free Trial Available [Dataset]. https://datarade.ai/data-products/nfl-data-historic-data-available-sports-data-national-fo-apiscrapy
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset authored and provided by
    APISCRAPY
    Area covered
    Ireland, Iceland, Bosnia and Herzegovina, Norway, Portugal, Lithuania, Italy, Malta, Poland, China
    Description

    Our NFL Data product offers extensive access to historic and current National Football League statistics and results, available in multiple formats. Whether you're a sports analyst, data scientist, fantasy football enthusiast, or a developer building sports-related apps, this dataset provides everything you need to dive deep into NFL performance insights.

    Key Benefits:

    Comprehensive Coverage: Includes historic and real-time data on NFL stats, game results, team performance, player metrics, and more.

    Multiple Formats: Datasets are available in various formats (CSV, JSON, XML) for easy integration into your tools and applications.

    User-Friendly Access: Whether you are an advanced analyst or a beginner, you can easily access and manipulate data to suit your needs.

    Free Trial: Explore the full range of data with our free trial before committing, ensuring the product meets your expectations.

    Customizable: Filter and download only the data you need, tailored to specific seasons, teams, or players.

    API Access: Developers can integrate real-time NFL data into their apps with API support, allowing seamless updates and user engagement.

    Use Cases:

    Fantasy Football Players: Use the data to analyze player performance, helping to draft winning teams and make better game-day decisions.

    Sports Analysts: Dive deep into historical and current NFL stats for research, articles, and game predictions.

    Developers: Build custom sports apps and dashboards by integrating NFL data directly through API access.

    Betting & Prediction Models: Use data to create accurate predictions for NFL games, helping sportsbooks and bettors alike.

    Media Outlets: Enhance game previews, post-game analysis, and highlight reels with accurate, detailed NFL stats.

    Our NFL Data product ensures you have the most reliable, up-to-date information to drive your projects, whether it's enhancing user experiences, creating predictive models, or simply enjoying in-depth football analysis.

  6. d

    Financial Statement Data Sets

    • catalog.data.gov
    • s.cnmilf.com
    Updated Jul 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Economic and Risk Analysis (2025). Financial Statement Data Sets [Dataset]. https://catalog.data.gov/dataset/financial-statement-data-sets
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset provided by
    Economic and Risk Analysis
    Description

    The data sets below provide selected information extracted from exhibits to corporate financial reports filed with the Commission using eXtensible Business Reporting Language (XBRL).

  7. Basic Computer Data

    • kaggle.com
    Updated May 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LiamLarsen (2017). Basic Computer Data [Dataset]. https://www.kaggle.com/kingburrito666/basic-computer-data-set/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    LiamLarsen
    Description

    For what?

    This dataset is for basic data analysis. Student Statisticians or Data-Analysists (like myself) could use this as a basic learning point. Even ML students could predict future prices and speeds of computers.

    Unfortunately, this dataset doesn't come with dates. (which are a pain to work with anyway), But the computers are in order from earliest to latest.

    I will be uploading another version with this and a more detailed CSV that has the computer name, date, and other stats. This dataset is free to use for any purpose.

    This is simply to gain understanding in analyzing data. At least for me.

    Content

    price, speed, hd, ram, screen, cd, multi, premium, ads, trend

    Something glorious is coming

    The largest computer CSV? Maybe? Maybe im scrapping it right now? Who knows? ;)

  8. Udemy Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Udemy Dataset [Dataset]. https://brightdata.com/products/datasets/udemy
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    We'll tailor a Udemy dataset to meet your unique needs, encompassing course titles, user engagement metrics, completion rates, demographic data of learners, enrollment numbers, review scores, and other pertinent metrics.

    Leverage our Udemy datasets for diverse applications to bolster strategic planning and market analysis. Scrutinizing these datasets enables organizations to grasp learner preferences and online education trends, facilitating nuanced educational program development and learning initiatives. Customize your access to the entire dataset or specific subsets as per your business requisites.

    Popular use cases involve optimizing educational content based on engagement insights, enhancing learning strategies through targeted learner segmentation, and identifying and forecasting trends to stay ahead in the online education landscape.

  9. d

    Dataplex: US Healthcare NPI Data | Access 8.5M B2B Contacts with Emails &...

    • datarade.ai
    .csv, .txt
    Updated Jul 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: US Healthcare NPI Data | Access 8.5M B2B Contacts with Emails & Phones | Perfect for Outreach & Market Research [Dataset]. https://datarade.ai/data-products/dataplex-us-healthcare-npi-data-access-8-5m-b2b-contacts-w-dataplex
    Explore at:
    .csv, .txtAvailable download formats
    Dataset updated
    Jul 13, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    United States
    Description

    US Healthcare NPI Data is a comprehensive resource offering detailed information on health providers registered in the United States.

    Dataset Highlights:

    • NPI Numbers: Unique identification numbers for health providers.
    • Contact Details: Includes addresses and phone numbers.
    • State License Numbers: State-specific licensing information.
    • Additional Identifiers: Other identifiers related to the providers.
    • Business Names: Names of the provider’s business entities.
    • Taxonomies: Classification of provider types and specialties.

    Taxonomy Data:

    • Includes codes, groupings, and classifications.
    • Facilitates detailed analysis and categorization of providers.

    Data Updates:

    • Weekly Delta Changes: Ensures the dataset is current with the latest changes.
    • Monthly Full Refresh: Comprehensive update to maintain accuracy.

    Use Cases:

    • Market Analysis: Understand the distribution and types of healthcare providers across the US. Analyze market trends and identify potential gaps in healthcare services.
    • Outreach: Create targeted marketing campaigns to reach specific types of healthcare providers. Use contact details for direct outreach and engagement with providers.
    • Research: Conduct in-depth research on healthcare providers and their specialties. Analyze provider attributes to support academic or commercial research projects.
    • Compliance and Verification: Verify provider credentials and compliance with state licensing requirements. Ensure accurate provider information for regulatory and compliance purposes.

    Data Quality and Reliability:

    • The dataset is meticulously curated to ensure high quality and reliability. Regular updates, both weekly and monthly, ensure that users have access to the most current information. The comprehensive nature of the data, combined with its regular updates, makes it a valuable tool for a wide range of applications in the healthcare sector.

    Access and Integration: - CSV Format: The dataset is provided in CSV format, making it easy to integrate with various data analysis tools and platforms. - Ease of Use: The structured format of the data ensures that it can be easily imported, analyzed, and utilized for various applications without extensive preprocessing.

    Ideal for:

    • Healthcare Professionals: Physicians, nurses, and other healthcare providers who need to verify information about their peers.
    • Analysts: Data analysts and business analysts who require detailed and accurate healthcare provider data for their projects.
    • Businesses: Companies in the healthcare sector looking to understand market dynamics and reach out to providers.
    • Researchers: Academic and commercial researchers conducting studies on healthcare providers and services.

    Why Choose This Dataset?

    • Comprehensive Coverage: Detailed information on millions of healthcare providers across the US.
    • Regular Updates: Weekly and monthly updates ensure that the data remains current and reliable.
    • Ease of Integration: Provided in a user-friendly CSV format for easy integration with your existing systems.
    • Versatility: Suitable for a wide range of applications, from market analysis to compliance and research.

    By leveraging the US Healthcare NPI & Taxonomy Data, users can gain valuable insights into the healthcare landscape, enhance their outreach efforts, and conduct detailed research with confidence in the accuracy and comprehensiveness of the data.

    Summary:

    • This dataset is an invaluable resource for anyone needing detailed and up-to-date information on US healthcare providers. Whether for market analysis, research, outreach, or compliance, the US Healthcare NPI & Taxonomy Data offers the detailed, reliable information needed to achieve your goals.
  10. A

    ‘New Year's Resolutions’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘New Year's Resolutions’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-new-year-s-resolutions-f13b/b75b1cb3/?iid=006-613&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘New Year's Resolutions’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/andrewmvd/new-years-resolutions on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    Another year comes to a close and with that an opportunity for new beginings - and New Year's Resolutions is an opportunity to do just that.

    At the same time, in a 2014 report, 35% of participants who failed their New Year's Resolutions admitted they had unrealistic goals, 33% of participants did not keep track of their progress, and 23% forgot about them; about one in 10 respondents claimed they made too many resolutions. [1]

    A 2007 study from the University of Bristol involving 3,000 people showed that 88% of those who set New Year resolutions fail, despite the fact that 52% of the study's participants were confident of success at the beginning. [2]

    With this dataset, containing 5011 tweets of new year's resolutions, you can use the collective knowledge to improve your odds of success in your own resolutions!

    How to use this dataset

    • Apply Topic Modeling or Clustering to Identify Common Goals;
    • Explore New Year's Resolutions and use this knowledge to make your own!

    Note that this dataset uses ; as delimiter, due to free text fields containing variable amount of commas.

    Highlighted Notebooks

    Acknowledgements

    If you use this dataset in your research, please credit the authors.

    Citation

    CrowdFlower.com [Internet]. Data for Everyone. Available from: https://www.crowdflower.com/data-for-everyone/.

    Sources used in the description

    [1] Hutchison, Michelle (29 December 2014). "Bunch of failures or just optimistic? finder.com.au New Year's Resolution Study shows New Year novelty fizzles fast - finder.com.au". finder.com.au. Retrieved 19 April 2018. [2] Lehrer, Jonah (December 26, 2009). "Blame It on the Brain". The Wall Street Journal. ISSN 0099-9660.

    License

    License was not specified at source, yet data is public and free.

    Splash banner

    Icon by Freepik.

    --- Original source retains full ownership of the source dataset ---

  11. f

    A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

    • acs.figshare.com
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    ACS Publications
    Authors
    Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

  12. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  13. f

    Data from: Excel Templates: A Helpful Tool for Teaching Statistics

    • tandf.figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alejandro Quintela-del-Río; Mario Francisco-Fernández (2023). Excel Templates: A Helpful Tool for Teaching Statistics [Dataset]. http://doi.org/10.6084/m9.figshare.3408052.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Alejandro Quintela-del-Río; Mario Francisco-Fernández
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This article describes a free, open-source collection of templates for the popular Excel (2013, and later versions) spreadsheet program. These templates are spreadsheet files that allow easy and intuitive learning and the implementation of practical examples concerning descriptive statistics, random variables, confidence intervals, and hypothesis testing. Although they are designed to be used with Excel, they can also be employed with other free spreadsheet programs (changing some particular formulas). Moreover, we exploit some possibilities of the ActiveX controls of the Excel Developer Menu to perform interactive Gaussian density charts. Finally, it is important to note that they can be often embedded in a web page, so it is not necessary to employ Excel software for their use. These templates have been designed as a useful tool to teach basic statistics and to carry out data analysis even when the students are not familiar with Excel. Additionally, they can be used as a complement to other analytical software packages. They aim to assist students in learning statistics, within an intuitive working environment. Supplementary materials with the Excel templates are available online.

  14. d

    Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends,...

    • datarade.ai
    .json, .csv
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-consumer-behavior-data-2-1m-subred-dataplex
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Aug 7, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    Tunisia, Saint Barthélemy, Cuba, Belize, Cocos (Keeling) Islands, Togo, Burkina Faso, Croatia, Netherlands, Lithuania
    Description

    The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

    Dataset Overview:

    This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

    2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

    Sourced Directly from Reddit:

    All data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

    Key Features:

    • Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.
    • User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.
    • Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.
    • AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

    Use Cases:

    • Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.
    • Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.
    • Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.
    • Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

    Data Quality and Reliability:

    The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

    Integration and Usability:

    The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

    User-Friendly Structure and Metadata:

    The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

    Ideal For:

    • Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.
    • Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.
    • Researchers: Explore consumer behavior data of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

    This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conducting acade...

  15. m

    Diabetes Dataset

    • data.mendeley.com
    Updated Jul 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahlam Rashid (2020). Diabetes Dataset [Dataset]. http://doi.org/10.17632/wj9rwkp9c2.1
    Explore at:
    Dataset updated
    Jul 18, 2020
    Authors
    Ahlam Rashid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The construction of diabetes dataset was explained. The data were collected from the Iraqi society, as they data were acquired from the laboratory of Medical City Hospital and (the Specializes Center for Endocrinology and Diabetes-Al-Kindy Teaching Hospital). Patients' files were taken and data extracted from them and entered in to the database to construct the diabetes dataset. The data consist of medical information, laboratory analysis. The data attribute are: The data consist of medical information, laboratory analysis… etc. The data that have been entered initially into the system are: No. of Patient, Sugar Level Blood, Age, Gender, Creatinine ratio(Cr), Body Mass Index (BMI), Urea, Cholesterol (Chol), Fasting lipid profile, including total, LDL, VLDL, Triglycerides(TG) and HDL Cholesterol , HBA1C, Class (the patient's diabetes disease class may be Diabetic, Non-Diabetic, or Predict-Diabetic).

  16. m

    Dataset to run examples in SmartPLS 3 (teaching and learning)

    • data.mendeley.com
    • narcis.nl
    Updated Mar 7, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diógenes de Bido (2019). Dataset to run examples in SmartPLS 3 (teaching and learning) [Dataset]. http://doi.org/10.17632/4tkph3mxp9.2
    Explore at:
    Dataset updated
    Mar 7, 2019
    Authors
    Diógenes de Bido
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This zip file contains: - 3 .zip files = projects to be imported into SmartPLS 3

    DLOQ-A model with 7 dimensions DLOQ-A model with second-order latent variable ECSI model (Tenenhaus et al., 2005) to exemplify direct, indirect and total effects, as well as importance-performance map and moderation with continuous variables. ECSI Model (Sanches, 2013) to exemplify MGA (multi-group analysis)

    • 5 files (csv, txt) with data to run 7 examples in SmartPLS 3

    Note: - DLOQ-A = new dataset (ours) - ECSI-Tenenhaus et al. [model for mediation and moderation] = available at: http://www.smartpls.com > Resources > SmartPLS Project Examples - ECSI-Sanches [dataset for MGA] = available in the software R > library(plspm) > data(satisfaction)

  17. d

    BASIC: Point of Interest (POI) Shopping Centers Dataset I Coverage USA |...

    • datarade.ai
    .csv, .xls
    Updated Feb 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CAP Locations (2025). BASIC: Point of Interest (POI) Shopping Centers Dataset I Coverage USA | Categorized by Center Type | 9 Attributes [Dataset]. https://datarade.ai/data-products/basic-cap-poi-data-shopping-centers-usa-43k-centers-with-cap-locations
    Explore at:
    .csv, .xlsAvailable download formats
    Dataset updated
    Feb 26, 2025
    Dataset authored and provided by
    CAP Locations
    Area covered
    United States
    Description

    CAP USA Shopping Centers Basic is an affordable and efficient solution designed to assist retailers, real estate professionals, and analysts in conducting entry-level assessments of shopping centers across the USA. This resource provides essential data to help classify and evaluate retail properties with ease.

    The dataset includes nine key attributes that aid in identifying and categorizing shopping center types. It features a Unique Property ID for each location, ensuring precise identification and seamless data integration. Additionally, users can quickly determine the size of shopping centers, making it easier to compare properties and assess market opportunities.

    With its cost-effective approach, CAP the USA Shopping Centers Basic offers a streamlined yet insightful way to support first-level analysis, helping businesses and investors make informed decisions efficiently.

  18. A

    ‘The Bronson Files, Dataset 4, Field 105, 2013’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 1, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2013). ‘The Bronson Files, Dataset 4, Field 105, 2013’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-the-bronson-files-dataset-4-field-105-2013-7c96/e98343bf/?iid=003-110&v=presentation
    Explore at:
    Dataset updated
    Aug 1, 2013
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘The Bronson Files, Dataset 4, Field 105, 2013’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/392f69f2-aa43-4e90-970d-33c36e011c19 on 11 February 2022.

    --- Dataset description provided by original source is as follows ---

    Dr. Kevin Bronson provides this unique nitrogen and water management in wheat agricultural research dataset for compute. Ten irrigation treatments from a linear sprinkler were combined with nitrogen treatments. This dataset includes notation of field events and operations, an intermediate analysis mega-table of correlated and calculated parameters, including laboratory analysis results generated during the experimentation, plus high resolution plot level intermediate data tables of SAS process output, as well as the complete raw sensors records and logger outputs.

    This data was collected during the beginning time period of our USDA Maricopa terrestrial proximal high-throughput plant phenotyping tri-metric method generation, where a 5Hz crop canopy height, temperature and spectral signature are recorded coincident to indicate a plant health status. In this early development period, our Proximal Sensing Cart Mark1 (PSCM1) platform supplants people carrying the CropCircle (CC) sensors, and with an improved view mechanical performance result.

    Experimental design and operational details of research conducted are contained in related published articles, however further description of the measured data signals as well as germane commentary is herein offered.

    The primary component of this dataset is the Holland Scientific (HS) CropCircle ACS-470 reflectance numbers. Which as derived here, consist of raw active optical band-pass values, digitized onboard the sensor product. Data is delivered as sequential serialized text output including the associated GPS information. Typically this is a production agriculture support technology, enabling an efficient precision application of nitrogen fertilizer. We used this optical reflectance sensor technology to investigate plant agronomic biology, as the ACS-470 is a unique performance product being not only rugged and reliable but illumination active and filter customizable.

    Individualized ACS-470 sensor detector behavior and subsequent index calculation influence can be understood through analysis of white-panel and other known target measurements. When a sensor is held 120cm from a titanium dioxide white painted panel, a normalized unity value of 1.0 is set for each detector. To generate this dataset we used a Holland Scientific SC-1 device and set the 1.0 unity value (field normalize) on each sensor individually, before each data collection, and without using any channel gain boost. The SC-1 field normalization device allows a communications connection to a Windows machine, where company provided sensor control software enables the necessary sensor normalization routine, and a real-time view of streaming sensor data.

    This type of active proximal multi-spectral reflectance data may be perceived as inherently “noisy”; however basic analytical description consistently resolves a biological patterning, and more advanced statistical analysis is suggested to achieve discovery. Sources of polychromatic reflectance are inherent in the environment; and can be influenced by surface features like wax or water, or presence of crystal mineralization; varying bi-directional reflectance in the proximal space is a model reality, and directed energy emission reflection sampling is expected to support physical understanding of the underling passive environmental system.

    Soil in view of the sensor does decrease the raw detection amplitude of the target color returned and can add a soil reflection signal component. Yet that return accurately represents a largely two-dimensional cover and intensity signal of the target material present within each view. It does however, not represent a reflection of the plant material solely because it can contain additional features in view. Expect NDVI values greater than 0.1 when sensing plants and saturating more around 0.8, rather than the typical 0.9 of passive NDVI.

    The active signal does not transmit energy to penetrate, perhaps past LAI 2.1 or less, compared to what a solar induced passive reflectance sensor would encounter. However the focus of our active sensor scan is on the uppermost expanded canopy leaves, and they are positioned to intercept the major solar energy. Active energy sensors are more easy to direct, and in our capture method we target a consistent sensor height that is 1m above the average canopy height, and maintaining a rig travel speed target around 1.5 mph, with sensors parallel to earth ground in a nadir view.

    We consider these CropCircle raw detector returns to be more “instant” in generation, and “less-filtered” electronically, while onboard the “black-box” device, than are other reflectance products which produce vegetation indices as averages of multiple detector samples in time.

    It is known through internal sensor performance tracking across our entire location inventory, that sensor body temperature change affects sensor raw detector returns in minor and undescribed yet apparently consistent ways.

    Holland Scientific 5Hz CropCircle active optical reflectance ACS-470 sensors, that were measured on the GeoScout digital propriety serial data logger, have a stable output format as defined by firmware version.

    Different numbers of csv data files were generated based on field operations, and there were a few short duration instances where GPS signal was lost, multiple raw data files when present, including white panel measurements before or after field collections, were combined into one file, with the inclusion of the null value placeholder -9999. Two CropCircle sensors, numbered 2 and 3, were used supplying data in a lined format, where variables are repeated for each sensor, creating a discrete data row for each individual sensor measurement instance.

    We offer six high-throughput single pixel spectral colors, recorded at 530, 590, 670, 730, 780, and 800nm. The filtered band-pass was 10nm, except for the NIR, which was set to 20 and supplied an increased signal (including increased noise).

    Dual, or tandem, CropCircle sensor paired usage empowers additional vegetation index calculations such as:
    DATT = (r800-r730)/(r800-r670)
    DATTA = (r800-r730)/(r800-r590)
    MTCI = (r800-r730)/(r730-r670)
    CIRE = (r800/r730)-1
    CI = (r800/r590)-1
    CCCI = NDRE/NDVIR800
    PRI = (r590-r530)/(r590+r530)
    CI800 = ((r800/r590)-1)
    CI780 = ((r780/r590)-1)

    The Campbell Scientific (CS) environmental data recording of small range (0 to 5 v) voltage sensor signals are accurate and largely shielded from electronic thermal induced influence, or other such factors by design. They were used as was descriptively recommended by the company. A high precision clock timing, and a recorded confluence of custom metrics, allow the Campbell Scientific raw data signal acquisitions a high research value generally, and have delivered baseline metrics in our plant phenotyping program. Raw electrical sensor signal captures were recorded at the maximum digital resolution, and could be re-processed in whole, while the subsequent onboard calculated metrics were often data typed at a lower memory precision and served our research analysis.

    Improved Campbell Scientific data at 5Hz is presented for nine collection events, where thermal, ultrasonic displacement, and additional GPS metrics were recorded. Ultrasonic height metrics generated by the Honeywell sensor and present in this dataset, represent successful phenotypic recordings. The Honeywell ultrasonic displacement sensor has worked well in this application because of its 180Khz signal frequency that ranges 2m space. Air temperature is still a developing metric, a thermocouple wire junction (TC) placed in free air with a solar shade produced a low-confidence passive ambient air temperature.

    Campbell Scientific logger derived data output is structured in a column format, with multiple sensor data values present in each data row. One data row represents one program output cycle recording across the sensing array, as there was no onboard logger data averaging or down sampling. Campbell Scientific data is first recorded in binary format onboard the data logger, and then upon data retrieval, converted to ASCII text via the PC based LoggerNet CardConvert application. Here, our full CS raw data output, that includes a four-line header structure, was truncated to a typical single row header of variable names. The -9999 placeholder value was inserted for null instances.

    There is canopy thermal data from three view vantages. A nadir sensor view, and looking forward and backward down the plant row at a 30 degree angle off nadir. The high confidence Apogee Instruments SI-111 type infrared radiometer, non-contact thermometer, serial number 1052 was in a front position looking forward away from the platform, number 1023 with a nadir view was in middle position, and sensor number 1022 was in a rear position and looking back toward the platform frame, until after 4/10/2013 when the order was reversed. We have a long and successful history testing and benchmarking performance, and deploying Apogee Instruments infrared radiometers in field experimentation. They are biologically spectral window relevant sensors and return a fast update 0.2C accurate average surface temperature, derived from what is (geometrically weighted) in their field of view.

    Data gaps do exist beyond null value -9999 designations, there are some instances when GPS signal was lost, or rarely on HS GeoScout logger error. GPS information may be missing at the start of data recording.

  19. m

    KU-HAR: An Open Dataset for Human Activity Recognition

    • data.mendeley.com
    Updated Feb 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah-Al Nahid (2021). KU-HAR: An Open Dataset for Human Activity Recognition [Dataset]. http://doi.org/10.17632/45f952y38r.5
    Explore at:
    Dataset updated
    Feb 16, 2021
    Authors
    Abdullah-Al Nahid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (Always use the latest version of the dataset. )

    Human Activity Recognition (HAR) refers to the capacity of machines to perceive human actions. This dataset contains information on 18 different activities collected from 90 participants (75 male and 15 female) using smartphone sensors (Accelerometer and Gyroscope). It has 1945 raw activity samples collected directly from the participants, and 20750 subsamples extracted from them. The activities are:

    Stand➞ Standing still (1 min) Sit➞ Sitting still (1 min) Talk-sit➞ Talking with hand movements while sitting (1 min) Talk-stand➞ Talking with hand movements while standing or walking(1 min) Stand-sit➞ Repeatedly standing up and sitting down (5 times) Lay➞ Laying still (1 min) Lay-stand➞ Repeatedly standing up and laying down (5 times) Pick➞ Picking up an object from the floor (10 times) Jump➞ Jumping repeatedly (10 times) Push-up➞ Performing full push-ups (5 times) Sit-up➞ Performing sit-ups (5 times) Walk➞ Walking 20 meters (≈12 s) Walk-backward➞ Walking backward for 20 meters (≈20 s) Walk-circle➞ Walking along a circular path (≈ 20 s) Run➞ Running 20 meters (≈7 s) Stair-up➞ Ascending on a set of stairs (≈1 min) Stair-down➞ Descending from a set of stairs (≈50 s) Table-tennis➞ Playing table tennis (1 min)

    Contents of the attached .zip files are: 1.Raw_time_domian_data.zip➞ Originally collected 1945 time-domain samples in separate .csv files. The arrangement of information in each .csv file is: Column 1, 5➞ exact time (elapsed since the start) when the Accelerometer & Gyro output was recorded (in ms) Col. 2, 3, 4➞ Acceleration along X,Y,Z axes (in m/s^2) Col. 6, 7, 8➞ Rate of rotation around X,Y,Z axes (in rad/s)

    2.Trimmed_interpolated_raw_data.zip➞ Unnecessary parts of the samples were trimmed (only from the beginning and the end). The samples were interpolated to keep a constant sampling rate of 100 Hz. The arrangement of information is the same as above.

    3.Time_domain_subsamples.zip➞ 20750 subsamples extracted from the 1945 collected samples provided in a single .csv file. Each of them contains 3 seconds of non-overlapping data of the corresponding activity. Arrangement of information: Col. 1–300, 301–600, 601–900➞ Acc.meter X, Y, Z axes readings Col. 901–1200, 1201–1500, 1501–1800➞ Gyro X, Y, Z axes readings Col. 1801➞ Class ID (0 to 17, in the order mentioned above) Col. 1802➞ length of the each channel data in the subsample Col. 1803➞ serial no. of the subsample

    Gravity acceleration was omitted from the Acc.meter data, and no filter was applied to remove noise. The dataset is free to download, modify, and use.

    More information is provided in the data paper which is currently under review: N. Sikder, A.-A. Nahid, KU-HAR: An open dataset for heterogeneous human activity recognition, Pattern Recognit. Lett. (submitted).

    A preprint will be available soon.

    Backup: drive.google.com/drive/folders/1yrG8pwq3XMlyEGYMnM-8xnrd6js0oXA7

  20. Data from: Red wine DataSet

    • kaggle.com
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suraj_kumar_Gupta (2023). Red wine DataSet [Dataset]. https://www.kaggle.com/datasets/soorajgupta7/red-wine-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Suraj_kumar_Gupta
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Datasets Description:

    The datasets under discussion pertain to the red and white variants of Portuguese "Vinho Verde" wine. Detailed information is available in the reference by Cortez et al. (2009). These datasets encompass physicochemical variables as inputs and sensory variables as outputs. Notably, specifics regarding grape types, wine brand, and selling prices are absent due to privacy and logistical concerns.

    Classification and Regression Tasks: One can interpret these datasets as being suitable for both classification and regression analyses. The classes are ordered, albeit imbalanced. For instance, the dataset contains a more significant number of normal wines compared to excellent or poor ones.

    Dataset Contents: For a comprehensive understanding, readers are encouraged to review the work by Cortez et al. (2009). The input variables, derived from physicochemical tests, include: 1. Fixed acidity 2. Volatile acidity 3. Citric acid 4. Residual sugar 5. Chlorides 6. Free sulfur dioxide 7. Total sulfur dioxide 8. Density 9. pH 10. Sulphates 11. Alcohol

    The output variable, based on sensory data, is denoted by: 12. Quality (score ranging from 0 to 10)

    Usage Tips: A practical suggestion involves setting a threshold for the dependent variable, defining wines with a quality score of 7 or higher as 'good/1' and the rest as 'not good/0.' This facilitates meaningful experimentation with hyperparameter tuning using decision tree algorithms and analyzing ROC curves and AUC values.

    Operational Workflow: To efficiently utilize the dataset, the following steps are recommended: 1. Utilize a File Reader (for csv) to a linear correlation node and an interactive histogram for basic Exploratory Data Analysis (EDA). 2. Employ a File Reader to a Rule Engine Node for transforming the 10-point scale to a dichotomous variable indicating 'good wine' and 'rest.' 3. Implement a Rule Engine Node output to an input of Column Filter node to filter out the original 10-point feature, thus preventing data leakage. 4. Apply a Column Filter Node output to the input of Partitioning Node to execute a standard train/test split (e.g., 75%/25%, choosing 'random' or 'stratified'). 5. Feed the Partitioning Node train data split output into the input of Decision Tree Learner node. 6. Connect the Partitioning Node test data split output to the input of Decision Tree predictor Node. 7. Link the Decision Tree Learner Node output to the input of Decision Tree Node. 8. Finally, connect the Decision Tree output to the input of ROC Node for model evaluation based on the AUC value.

    Tools and Acknowledgments: For an efficient analysis, consider using KNIME, a valuable graphical user interface (GUI) tool. Additionally, the dataset is available on the UCI machine learning repository, and proper acknowledgment and citation of the dataset source by Cortez et al. (2009) are essential for use.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dr. Rich (2020). Human Resources Data Set [Dataset]. http://doi.org/10.34740/kaggle/dsv/1572001
Organization logo

Human Resources Data Set

Dataset used for learning data visualization and basic regression

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 19, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dr. Rich
Description

Updated 30 January 2023

Version 14 of Dataset

License Update:

There has been some confusion around licensing for this data set. Dr. Carla Patalano and Dr. Rich Huebner are the original authors of this dataset.

We provide a license to anyone who wishes to use this dataset for learning or teaching. For the purposes of sharing, please follow this license:

CC-BY-NC-ND This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Codebook

https://rpubs.com/rhuebner/hrd_cb_v14

PLEASE NOTE -- I recently updated the codebook - please use the above link. A few minor discrepancies were identified between the codebook and the dataset. Please feel free to contact me through LinkedIn (www.linkedin.com/in/RichHuebner) to report discrepancies and make requests.

Context

HR data can be hard to come by, and HR professionals generally lag behind with respect to analytics and data visualization competency. Thus, Dr. Carla Patalano and I set out to create our own HR-related dataset, which is used in one of our graduate MSHRM courses called HR Metrics and Analytics, at New England College of Business. We created this data set ourselves. We use the data set to teach HR students how to use and analyze the data in Tableau Desktop - a data visualization tool that's easy to learn.

This version provides a variety of features that are useful for both data visualization AND creating machine learning / predictive analytics models. We are working on expanding the data set even further by generating even more records and a few additional features. We will be keeping this as one file/one data set for now. There is a possibility of creating a second file perhaps down the road where you can join the files together to practice SQL/joins, etc.

Note that this dataset isn't perfect. By design, there are some issues that are present. It is primarily designed as a teaching data set - to teach human resources professionals how to work with data and analytics.

Content

We have reduced the complexity of the dataset down to a single data file (v14). The CSV revolves around a fictitious company and the core data set contains names, DOBs, age, gender, marital status, date of hire, reasons for termination, department, whether they are active or terminated, position title, pay rate, manager name, and performance score.

Recent additions to the data include: - Absences - Most Recent Performance Review Date - Employee Engagement Score

Acknowledgements

Dr. Carla Patalano provided the baseline idea for creating this synthetic data set, which has been used now by over 200 Human Resource Management students at the college. Students in the course learn data visualization techniques with Tableau Desktop and use this data set to complete a series of assignments.

Inspiration

We've included some open-ended questions that you can explore and try to address through creating Tableau visualizations, or R or Python analyses. Good luck and enjoy the learning!

  • Is there any relationship between who a person works for and their performance score?
  • What is the overall diversity profile of the organization?
  • What are our best recruiting sources if we want to ensure a diverse organization?
  • Can we predict who is going to terminate and who isn't? What level of accuracy can we achieve on this?
  • Are there areas of the company where pay is not equitable?

There are so many other interesting questions that could be addressed through this interesting data set. Dr. Patalano and I look forward to seeing what we can come up with.

If you have any questions or comments about the dataset, please do not hesitate to reach out to me on LinkedIn: http://www.linkedin.com/in/RichHuebner

You can also reach me via email at: Richard.Huebner@go.cambridgecollege.edu

Search
Clear search
Close search
Google apps
Main menu