10 datasets found
  1. Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm

    • plos.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tracey L. Weissgerber; Natasa M. Milic; Stacey J. Winham; Vesna D. Garovic (2023). Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm [Dataset]. http://doi.org/10.1371/journal.pbio.1002128
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Tracey L. Weissgerber; Natasa M. Milic; Stacey J. Winham; Vesna D. Garovic
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Figures in scientific publications are critically important because they often show the data supporting key findings. Our systematic review of research articles published in top physiology journals (n = 703) suggests that, as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies. Papers rarely included scatterplots, box plots, and histograms that allow readers to critically evaluate continuous data. Most papers presented continuous data in bar and line graphs. This is problematic, as many different data distributions can lead to the same bar or line graph. The full data may suggest different conclusions from the summary statistics. We recommend training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies. Investigators can quickly make univariate scatterplots for small sample size studies using our Excel templates.

  2. Law and Order TV Series Dataset

    • kaggle.com
    zip
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Law and Order TV Series Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/law-and-order-tv-series-dataset
    Explore at:
    zip(1443584 bytes)Available download formats
    Dataset updated
    Dec 8, 2023
    Authors
    The Devastator
    Description

    Law and Order TV Series Dataset

    Law and Order TV Series Data

    By Gove Allen [source]

    About this dataset

    The Law and Order Dataset is a comprehensive collection of data related to the popular television series Law and Order that aired from 1990 to 2010. This dataset, compiled by IMDB.com, provides detailed information about each episode of the show, including its title, summary, airdate, director, writer, guest stars, and IMDb rating.

    With over 450 episodes spanning 20 seasons of the original series as well as its spin-offs like Law and Order: Special Victims Unit, this dataset offers a wealth of information for analyzing various facets of criminal justice and law enforcement portrayed in the show. Whether you are a student or researcher studying crime-related topics or simply an avid fan interested in exploring behind-the-scenes details about your favorite episodes or actors involved in them, this dataset can be a valuable resource.

    By examining this extensive collection of data using SQL queries or other analytical techniques, one can gain insights into patterns such as common tropes used in different seasons or characters that appeared most frequently throughout the series. Additionally, researchers can investigate correlations between factors like episode directors/writers and their impact on viewer ratings.

    This dataset allows users to dive deep into analyzing aspects like crime types covered within episodes (e.g., homicide cases versus white-collar crimes), how often certain guest stars made appearances (including famous actors who had early roles on the show), or which writers/directors contributed most consistently high-rated episodes. Such analyses provide opportunities for uncovering trends over time within Law and Order's narrative structure while also shedding light on societal issues addressed by the series.

    By making this dataset available for educational purposes at collegiate levels specifically aimed at teaching SQL skills—a powerful tool widely used in data analysis—the intention is to empower students with real-world examples they can explore hands-on while honing their database querying abilities. The graphical representation accompanying this dataset further enhances understanding by providing visualizations that illustrate key relationships between different variables.

    Whether you are a seasoned data analyst, a budding criminologist, or simply looking to understand the intricacies of one of the most successful crime dramas in television history, the Law and Order Dataset offers you a vast array of information ripe for exploration and analysis

    How to use the dataset

    Understanding the Columns

    Before diving into analyzing the data, it's important to understand what each column represents. Here is an overview:

    • Episode: The episode number within its respective season.
    • Title: The title of each episode.
    • Season: The season number in which each episode belongs.
    • Year: The year in which each episode was released.
    • Rating: IMDB rating for each episode (on a scale from 0-10).
    • Votes: Number of votes received by each episode on IMDB.
    • Description: Brief summary or description of each episode's plot.
    • Director: Director(s) responsible for directing an episode.
    • Writers: Writer(s) credited for writing an episode.
    • Stars : Actor(s) who starred in an individual episode.

    Exploring Episode Data

    The dataset allows you to explore various aspects of individual episodes as well as broader trends throughout different seasons:

    1. Analyzing Ratings:

    - You can examine how ratings vary across seasons using aggregation functions like average (AVG), minimum (MIN), maximum (MAX), etc., depending on your analytical goals.
    - Identify popular episodes by sorting based on highest ratings or most votes received.
    

    2.Trends over Time:

    - Investigate how ratings have changed over time by visualizing them using line charts or bar graphs based on release years or seasons.
    - Examine if there are any significant fluctuations in ratings across different seasons or years.
    

    3. Directors and Writers:

    - Identify episodes directed by a specific director or written by particular writers by filtering the dataset based on their names.
    - Analyze the impact of different directors or writers on episode ratings.
    

    4. Popular Actors:

    - Explore episodes featuring popular actors from the show such as Mariska Hargitay (Olivia Benson), Christopher Meloni (Elliot Stabler), etc.
    - Investigate whether episodes with popular actors received higher ratings compared to ...
    
  3. h

    synthetic_chart

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hau Kieu, synthetic_chart [Dataset]. https://huggingface.co/datasets/YuukiAsuna/synthetic_chart
    Explore at:
    Authors
    Hau Kieu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic Chart Dataset

      Overview
    

    The Synthetic Chart Dataset is a curated collection of 1500 synthetic chart images paired with their structural representations.It supports research on chart understanding, visual reasoning, and graph-based data interpretation. Each example contains:

    A chart image The chart’s type (e.g., bar, pie, line, etc.)
    The difficulty level (easy, medium, or hard)
    A node description elements in chart An edge field for relationship information of… See the full description on the dataset page: https://huggingface.co/datasets/YuukiAsuna/synthetic_chart.

  4. NFL Players Performance and Salary

    • kaggle.com
    zip
    Updated Dec 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). NFL Players Performance and Salary [Dataset]. https://www.kaggle.com/datasets/thedevastator/nfl-player-performance-and-salary-insights-2018
    Explore at:
    zip(100140 bytes)Available download formats
    Dataset updated
    Dec 4, 2022
    Authors
    The Devastator
    Description

    NFL Players Performance and Salary

    Uncover Trends, Make Predictions and Analyze Demographics

    By Ben Jones [source]

    About this dataset

    This Kaggle dataset contains unique and fascinating insights into the 2018-2019 season of the NFL. It provides comprehensive data such as player #, position, height, weight, age, experience level in years, college attended and the team they are playing for. All these attributes can be used to expand on research within the NFL community. From uncovering demographics of individual teams to discovering correlations between players' salaries and performance - this dataset has endless possibilities for researchers to dive deeply into. Whether you are searching for predictions about future seasons or creating complex analyses using this data - it will give you a detailed view of the 2018-2019 season like never before! Explore why each team is special, who shone individually that year and what strategies could have been employed more efficiently throughout with this captivating collection of 2019-2018 NFL Players Stats & Salaries!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Get familiar with the characteristics of each column in our data set: Rk, Player, Pos, Tm, Cap Hit Player # , HT , WT Age , Exp College Team Rk Tm . Understanding these columns is key for further analysis since you can use each attribute for unique insights about NFL players' salaries and performance during this season. For example, HT (height) and WT (weight) are useful information if you want to study any correlations between player body types and their salaries or game performances. Another example would be Pos (position); it is a critical factor that determines how much a team pays its players for specific roles on the field such as quarterbacks or running backs etc.
    • Use some visualizations on your data as it helps us better understand what we observe from statistical data points when placed into graphical forms like scatter plots or bar charts. Graphical representations are fantastic at helping us see correlations in our datasets; they let us draw conclusions quickly by comparing datasets side by side or juxtaposing various attributes together in order explore varying trends across different teams of players etc.. Additionally, you could also represent all 32 teams graphically according to their Cap Hits so that viewers can spot any outlier values quickly without having to scan a table full of numbers – map based visualizations come extremely handy here!
    • Employ analytical techniques such as regular expression matching (RegEx) if needed; RegEx enables us detect patterns within text fields within your datasets making them exceptionally useful when trying discovering insights from large strings like college team name URLSs [for example] . This could potentially lead you towards deeper exploration into why certain franchises may have higher salaried players than others etc..
    • Finally don't forget all mathematical tools available at your disposal; statistics involves sophisticated operations like proportions / ratios/ averages/ medians - be sure take advantage these basic math features because quite often they end up revealing dazzling new facets inside your datasets which help uncover more interesting connections & relationships between two separate entities such as how does height compare against drafted college etc..?

    We hope these tips help those looking forward unlocking hidden gems hidden

    Research Ideas

    • Analyzing the impact of position on salaries: This dataset can be used to compare salaries across different positions and analyze the correlations between players’ performance, experience, and salaries.
    • Predicting future NFL MVP candidates: By analyzing popular statistical categories such as passing yards, touchdowns, interceptions and rushing yards for individual players over several seasons, researchers could use this data to predict future NFL MVPs each season.
    • Exploring team demographics: By looking into individual teams' player statistics such as age, height and weight distribution, researchers can analyze and compare demographic trends across the league or within a single team during any given season

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even co...

  5. f

    Theft Incidents in Mexico by State (2015-2025)

    • figshare.com
    csv
    Updated Sep 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Montserrat Mora (2025). Theft Incidents in Mexico by State (2015-2025) [Dataset]. http://doi.org/10.6084/m9.figshare.29606657.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 21, 2025
    Dataset provided by
    figshare
    Authors
    Montserrat Mora
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Mexico
    Description

    This dataset contains monthly records of theft-related crimes reported across the 32 states of Mexico from January 2015 through August 2025. Sourced from the official open data portal of the Executive Secretariat of the National Public Security System (SESNSP), the data categorizes theft by type and by whether the crime involved violence.Dataset FieldsPERIOD: Reporting period in YYYY-MM-DD format.STATE_ID: Numeric identifier for each Mexican state.STATE: Name of the Mexican state.CRIME: Category of theft (e.g., "Bank robbery", "Motor vehicle theft").CRIME_ES: Original crime category name in Spanish, as provided in the source data.MODALITY: Indicates whether the crime was committed "With violence" or "Without violence".TOTAL_CASES: Number of reported incidents for the specified category, time, and location.Supplementary MaterialsThis dataset is part of a broader project that includes:A Python script with two main functions:Both functions are illustrated with sample charts included in the materials.Function 1 generates normalized bar charts to visualize the proportion of each theft type by modality, configurable by state and year.Function 2 produces a stacked bar chart with an STL-based trend line to show the evolution of a specific crime over time, configurable by state and crime.A requirements.txt file listing Python dependencies for easy environment setup.Potential ApplicationsIdeal for researchers, data analysts, and policy professionals, this dataset supports the study of crime trends, regional disparities in theft modalities, and the evaluation of public security policies.Sourcehttps://www.gob.mx/sesnsp/acciones-y-programas/datos-abiertos-de-incidencia-delictiva

  6. Large-Scale Preference Dataset

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Large-Scale Preference Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/large-scale-preference-dataset/discussion
    Explore at:
    zip(207130812 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Large-Scale Preference Dataset

    Training Powerful Reward & Critic Models with Aligned Language Models

    By Huggingface Hub [source]

    About this dataset

    UltraFeedback is an unprecedentedly expansive, meticulously detailed, and multifarious preference dataset built exclusively to train powerful reward and critic models with aligned language models. With thousands of prompts lifted from countless distinct sources like UltraChat, ShareGPT, Evol-Instruet, TruthfulQA and more, UltraFeedback contains an overwhelming 256k samples – perfect for introducing to a wide array of AI-driven projects. Dive into the selection of correct answers and incorrect answers attached to this remarkable collection easily within the same data file! Get up close in exploring options presented in UltraFeedback – a groundbreaking new opportunity for data collectors!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The first step is to understand the content of the dataset, including source, models, correct answers and incorrect answers. Knowing which language models (LM) were used to generate completions can help you better interpret the data in this dataset.

    Once you are familiar with the column titles and their meanings it’s time to begin exploring! To maximize your insight into this data set use a variety of visualization techniques such as scatter plots or bar charts to view sample distributions across different LMs or answer types. Analyzing trends between incorrect and correct answers through data manipulation techniques such as merging sets can also provide valuable insights into preferences across different prompts and sources.

    Finally, you may want to try running LR or other machine learning models on this dataset in order to create simple models for predicting preferences when given inputs from real world scenarios related to specific tasks that require nuanced understanding of instructions provided by one’s peers or superiors.

    The possibilities for further exploration of this dataset are endless - now let’s get started!

    Research Ideas

    • Training sentence completion models on the dataset to generate responses with high accuracy and diversity.
    • Creating natural language understanding (NLU) tasks such as question-answering and sentiment analysis using the aligned dataset as training/testing sets.
    • Developing strongly supervised learning algorithms that are able to use techniques like reward optimization with potential translation applications in developing machine translation systems from scratch or upstream text-generation tasks like summarization, dialog generation, etc

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------------|:---------------------------------------------------------------| | source | The source of the data. (String) | | instruction | The instruction given to the language models. (String) | | models | The language models used to generate the completions. (String) | | correct_answers | The correct answers to the instruction. (String) | | incorrect_answers | The incorrect answers to the instruction. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  7. Men's Mile Run World Record Progression History

    • kaggle.com
    zip
    Updated Jan 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Men's Mile Run World Record Progression History [Dataset]. https://www.kaggle.com/datasets/thedevastator/men-s-mile-run-world-record-progression-history
    Explore at:
    zip(3258 bytes)Available download formats
    Dataset updated
    Jan 14, 2023
    Authors
    The Devastator
    Description

    Men's Mile Run World Record Progression History (1861-Present)

    Examining the Athlete, Nationality and Venue Influence on Race Times

    By Ben Jones [source]

    About this dataset

    This remarkable dataset chronicles the world record progression of the men's mile run, containing detailed information on each athlete's time, their name, nationality, date of their accomplishment and the location of their event. It allows us to look back in history and get a comprehensive overview of how this track event has progressed over time. Analyzing this information can help us understand how training and technology have improved the event over the years; as well as study different athletes' performances and learn how some athletes have pushed beyond their limits or fell short. This valuable resource is an essential source for anyone intrigued by the cutting edge achievements in men's mile running world records. Discovering powerful insights from this dataset can allow us to gain perspective into not only our own personal goals but also uncover ideas on how we could continue pushing our physical boundaries by watching past successes. Explore and comprehend for yourself what it means to be a true athlete at heart!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This guide provides an introduction on how best to use this dataset in order to analyze various aspects involving the men’s mile run world records. We will focus on analyzing specific fields such as date, athlete name & nationality, time taken for completion and auto status by using statistical methods and graphical displays of data.

    In order to use this data effectively it is important that you understand what each field measures: • Time: The amount of time it took for an athlete to finish a race - measured in minutes and seconds (example: 3:54).
    • Auto: Whether or not a pacemaker was used during a specific race (example ; yes/no).
    • Athlete Name & Nationality: The name and nationality associated with an athlete who set \record(example; Usain Bolt - Jamaica).
    • Date : Year representing when a specific record was set by an individual( example-2021 ). •Venue : Location at which the record is set.(example; London Olympic Stadium )

    Now that you understand which fields measure what let’s discuss various ways that you can use these datasets features. Analyzing trends in historical sporting performances has long been utilized as means for understanding changes brought about by new training methods/technologies etc., over time . This can be done with our dataset by using basic statistical displays like bar graphs & average analysis or more advanced methods such as regression analysis or even Bayesian approaches etc..The first thing anyone interested should do when dealing with this sort of data is inspect any wacky outliers before beginning more rigorous analysis; if one discovers any potential unreasonable values it would be best to discard them before building after models or readings based off them (this sort of elimination is common practice).After cleaning your work space let’s move onto building interactive visual display through graphics ,plotting different columns against one another e.g., – plotting time against date allows us see changes overtime from 1861 until now . Additionally plotting time vs Auto allows us see any

    Research Ideas

    • Comparing individual athletes and identifying those who have consistently pushed the event to higher levels of performance.
    • Analyzing national trends related to improvement in track records over time, based on differences in training and technology.
    • Creating a heatmap to visualize the progression of track records around the world and locate regions with a particularly strong historical performance in this event

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. -...

  8. Human resources dataset

    • kaggle.com
    zip
    Updated Mar 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khanh Nguyen (2023). Human resources dataset [Dataset]. https://www.kaggle.com/datasets/khanhtang/human-resources-dataset
    Explore at:
    zip(17041 bytes)Available download formats
    Dataset updated
    Mar 15, 2023
    Authors
    Khanh Nguyen
    Description
    • The HR dataset is a collection of employee data that includes information on various factors that may impact employee performance. To explore the employee performance factors using Python, we begin by importing the necessary libraries such as Pandas, NumPy, and Matplotlib, then load the HR dataset into a Pandas DataFrame and perform basic data cleaning and preprocessing steps such as handling missing values and checking for duplicates.

    • The dataset also use various data visualization to explore the relationships between different variables and employee performance. For example, scatterplots to examine the relationship between job satisfaction and performance ratings, or bar charts to compare the average performance ratings across different gender or positions.

  9. International Cricket Players Dataset 🏏

    • kaggle.com
    Updated Dec 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kashish Parmar (2023). International Cricket Players Dataset 🏏 [Dataset]. https://www.kaggle.com/datasets/kashishparmar02/international-cricket-players-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 29, 2023
    Dataset provided by
    Kaggle
    Authors
    Kashish Parmar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About This Dataset: Explore the dynamic world of international cricket with this comprehensive dataset featuring players from A to Z. Dive into the rich details of each player, including their birthdates, country of origin, and performance statistics in Test, ODI, and T20 formats. Whether you're a cricket enthusiast, analyst, or simply curious about the global cricket landscape, this dataset provides a valuable resource for understanding the diverse profiles of cricket players across different nations. Uncover trends, compare player performances, and gain insights into the fascinating world of cricket through this meticulously curated dataset. 🌐🏏

    Key Features

    Column NameDescriptionExample Values
    NamePlayer's full nameL F Kline
    Date_Of_BirthPlayer's date of birth29/09/1934
    CountryPlayer's country of originAustralia
    TestNumber of Test matches played13
    ODINumber of ODI matches played (N/A if not played)N/A
    T20Number of T20 matches played (N/A if not played)N/A

    How to Use This Dataset:

    1. Exploring Player Profiles:

      • Use the "Name" column to identify specific players.
      • Utilize "Date_Of_Birth" for understanding the age of each player.
      • Explore "Country" to analyze the diversity of players from different nations.
    2. Analyzing Performance Statistics:

      • Examine "Test," "ODI," and "T20" columns to understand the number of matches played in each format.
      • Identify trends by comparing statistics across players or countries.
    3. Filtering Data:

      • Use filtering mechanisms to focus on specific subsets of players based on criteria such as country, age, or playing format.
    4. Missing Data Handling:

      • Note that "N/A" in "ODI" and "T20" columns indicates that the player hasn't played matches in these formats.
    5. Visualizations:

      • Create visualizations, such as bar charts or scatter plots, to represent player distributions, age trends, or performance comparisons.
    6. Statistical Analysis:

      • Conduct statistical analyses to identify patterns or correlations between variables.
    7. Contributions and Feedback:

      • If you find interesting insights or have suggestions for improvement, consider contributing your analyses or providing feedback.
    8. Acknowledgments:

      • If you use this dataset in your work, acknowledge the source and any applicable terms of use.
  10. Salaries case study

    • kaggle.com
    zip
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shobhit Chauhan (2024). Salaries case study [Dataset]. https://www.kaggle.com/datasets/satyam0123/salaries-case-study
    Explore at:
    zip(13105509 bytes)Available download formats
    Dataset updated
    Oct 2, 2024
    Authors
    Shobhit Chauhan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:

    Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.

    Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tracey L. Weissgerber; Natasa M. Milic; Stacey J. Winham; Vesna D. Garovic (2023). Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm [Dataset]. http://doi.org/10.1371/journal.pbio.1002128
Organization logo

Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm

Explore at:
312 scholarly articles cite this dataset (View in Google Scholar)
docxAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Tracey L. Weissgerber; Natasa M. Milic; Stacey J. Winham; Vesna D. Garovic
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Figures in scientific publications are critically important because they often show the data supporting key findings. Our systematic review of research articles published in top physiology journals (n = 703) suggests that, as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies. Papers rarely included scatterplots, box plots, and histograms that allow readers to critically evaluate continuous data. Most papers presented continuous data in bar and line graphs. This is problematic, as many different data distributions can lead to the same bar or line graph. The full data may suggest different conclusions from the summary statistics. We recommend training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies. Investigators can quickly make univariate scatterplots for small sample size studies using our Excel templates.

Search
Clear search
Close search
Google apps
Main menu