7 datasets found
  1. Assessing Computational Notebook Understandability through Code Metrics...

    • zenodo.org
    zip
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mojtaba Mostafavi; Mojtaba Mostafavi (2023). Assessing Computational Notebook Understandability through Code Metrics Analysis [Dataset]. http://doi.org/10.5281/zenodo.8435192
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mojtaba Mostafavi; Mojtaba Mostafavi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the comprehension of the notebook code and identify the notebook metrics that play a significant role in their comprehension. The level of code comprehension is a qualitative variable closely associated with the user's opinion about the code. Previous studies have typically employed two approaches to measure it. One approach involves using limited questionnaire methods to review a small number of code pieces. Another approach relies solely on metadata, such as the number of likes and user votes for a project in the software repository. In our approach, we enhanced the measurement of the understandability level of notebook code by leveraging user comments within a software repository. As a case study, we started with 248,761 Kaggle Jupyter notebooks introduced in previous studies and their relevant metadata. To identify user comments associated with code comprehension within the notebooks, we utilized a fine-tuned DistillBERT transformer. We established a \emph{user comment based criterion} for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, the total views of the notebook, and the total upvotes received by the notebook. This criterion has proven to be more effective than alternative methods, making it the ground truth for evaluating the code comprehension of our notebook set. In addition, we collected a total of 34 metrics for 10,857 notebooks, categorized as script-based and notebook-based metrics. These metrics were utilized as features in our dataset. Using the Random Forest classifier, our predictive model achieved 85% accuracy in predicting code comprehension levels in computational notebooks, identifying developer expertise and markdown-based metrics as key factors.

  2. House Prices: Advanced Regression 'solution' file

    • kaggle.com
    Updated Sep 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2020). House Prices: Advanced Regression 'solution' file [Dataset]. https://www.kaggle.com/carlmcbrideellis/house-prices-advanced-regression-solution-file/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 11, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    One of the most popular competitions on kaggle is the House Prices: Advanced Regression Techniques. The original data comes from the publication Dean De Cock "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project", Journal of Statistics Education, Volume 19, Number 3 (2011). Recently a 'demonstration' notebook has been published "First place is meaningless in this way!" that extracts the 'solution' from the full dataset. Now that the 'solution' is readily available the possibility has opened for people to reproduce the competition at home without any daily submission limit. This will open up the possibility of experimenting with advanced techniques such as pipelines with/or various estimators/models in the same notebook, extensive hyper-parameter tuning etc. And all without the risk of 'upsetting' the public leaderboard. Simply download this solution.csv file and import it into your script or notebook and evaluate the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the data in this file.

    Content

    This dataset is the submission.csv file that will produce a public leaderboard score of 0.00000.

    Acknowledgements

  3. Bland-Altman Analysis

    • kaggle.com
    Updated Oct 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marília Prata (2020). Bland-Altman Analysis [Dataset]. https://www.kaggle.com/mpwolke/cusersmarildownloadsaltmancsv/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Kaggle
    Authors
    Marília Prata
    Description

    Context

    In 1983 Altman and Bland (B&A) proposed an alternative analysis, based on the quantification of the agreement between two quantitative measurements by studying the mean difference and constructing limits of agreement. https://www.kaggle.com/mpwolke/snip-test-bland-altman-analysis/notebook

    Vanessa Resqueti, Gulherme Fregonezi, Layana Marques, Ana Lista-Paz, Ana Aline Marcelino and Rodrigo Torres-Castro, “Reliability of SNIP test in healthy children.” Kaggle, doi: 10.34740/KAGGLE/DSV/1539628.

    Content

    The B&A plot analysis is a simple way to evaluate a bias between the mean differences, and to estimate an agreement interval, within which 95% of the differences of the second method, compared to the first one, fall. Data can be analyzed both as unit differences plot and as percentage differences plot. The B&A plot method only defines the intervals of agreements, it does not say whether those limits are acceptable or not. Acceptable limits must be defined a priori, based on clinical necessity, biological considerations or other goals.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4470095/

    Acknowledgements

    Altman and Bland (B&A)

    Vanessa Resqueti, Gulherme Fregonezi, Layana Marques, Ana Lista-Paz, Ana Aline Marcelino and Rodrigo Torres-Castro, “Reliability of SNIP test in healthy children.” Kaggle, doi: 10.34740/KAGGLE/DSV/1539628.

    https://www.kaggle.com/anaalinemarcelino/reiability-of-snip-test-in-healthy-children/metadata

    Photo by Chromatograph on Unsplash

    Inspiration

    Bland Altman analysis.

  4. Fine-tuned Llama2 for financial sentiment analysis

    • kaggle.com
    Updated Mar 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luca Massaron (2024). Fine-tuned Llama2 for financial sentiment analysis [Dataset]. https://www.kaggle.com/datasets/lucamassaron/fine-tuned-llama2-for-financial-sentiment-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Luca Massaron
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Based on the notebook https://www.kaggle.com/code/lucamassaron/fine-tune-llama-2-for-sentiment-analysis this dataset contains the fine-tuned Llama 2 model trained on the annotated dataset of approximately 5,000 sentences from the Aalto University School of Business (Malo, P., Sinha, A., Korhonen, P., Wallenius, J., & Takala, P., 2014, “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the Association for Information Science and Technology, 65[4], 782–796 - https://arxiv.org/abs/1307.5336). This collection aimed to establish human-annotated benchmarks, serving as a standard for evaluating alternative modeling techniques. The involved annotators (16 people with adequate background knowledge of financial markets) were instructed to assess the sentences solely from an investor's perspective, evaluating whether the news potentially holds a positive, negative, or neutral impact on the stock price.

  5. U.S. Tobacco Use Data

    • kaggle.com
    Updated Jan 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). U.S. Tobacco Use Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/u-s-tobacco-use-data-1995-2010
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    U.S. Tobacco Use Data

    Prevalence and Trends by State

    By Health [source]

    About this dataset

    This dataset provides insight into the prevalence and trends in tobacco use across the United States. By breaking down this data by state, you can see how tobacco has been used and changed over time. Smoking is a major contributor to premature deaths and health complications, so understanding historic usage rates can help us analyze and hopefully reduce those negative impacts. Drawing from the Behavioral Risk Factor Surveillance System, this dataset gives us an unparalleled look at both current and historical smoking habits in each of our states. With this data, we can identify high risk areas and track changes throughout the years for better health outcomes overall

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains information on the prevalence and trends of tobacco use in the United States. The data is broken down by state, and includes percentages of smokers, former smokers, and those who have never smoked. With this dataset you can explore how smoking habits have changed over time as well as what regions of the country have seen more or less consistent smoking trends.

    To begin using this dataset, you will first want to familiarize yourself with the columns included within it and their associated values. There is a “State” column that provides the US state for which each row refers to; there are also columns detailing percentages for those who smoke every day (Smoke Everyday), some days (Smoke Some Days), previously smoked (Former Smoker) and those who have never smoked (Never Smoked). The “Location 1” column indicates each geographic region that falls into one of either four US census divisions or eight regions based upon where each state lies in relation to one another.

    Once you understand the data presented within these columns, there are a few different ways to begin exploring how tobacco use has changed throughout time including plotting prevalence data over different periods such as decades or specific years; compiling descriptive statistics such as percentiles or mean values; contrasting between states based on any relevant factors such as urban/rural population size or economic/political standing; and lastly looking at patterns developing throughout multiple years via various visualisations like box-and-whisker plots amongst other alternatives.

    This wide set of possibilities makes this dataset interesting enough regardless if you are looking at regional differences across single points in time or long-term changes regarding national strategies around reducing nicotine consumption. With all its nuances uncovered hopefully your results can lead towards further research uncovering any aspect about smoking culture you may find fascinating!

    Research Ideas

    • Comparing regional and state-level smoking rates and trends over time.
    • Analyzing how different demographics are affected by state-level smoking trends, such as comparing gender or age-based differences in prevalence and/or decreasing or increasing rates of tobacco use at the regional level over time.
    • Developing visualization maps that show changes in tobacco consumption prevalence (and related health risk factors) by location on an interactive website or tool for public consumption of data insights from this dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: Open Database License (ODbL) v1.0 - You are free to: - Share - copy and redistribute the material in any medium or format. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices. - No Derivatives - If you remix, transform, or build upon the material, you may not distribute the modified material. - No additional restrictions - You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

    Columns

    File: BRFSS_Prevalence_and_Trends_Data_Tobacco_Use_-_Four_Level_Smoking_Data_for_1995-2010.csv | Column name | ...

  6. AI-Based Job Site Matching

    • kaggle.com
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). AI-Based Job Site Matching [Dataset]. https://www.kaggle.com/datasets/thedevastator/ai-based-job-site-matching/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    AI-Based Job Site Matching

    Leveraging 400k+ Hours of Resource & Performance Data

    By [source]

    About this dataset

    As you savvy job-seekers know, selecting an optimal site for GlideinWMS jobs is no small feat -weighing so many critical variables, and performing the highly sophisticated calculations needed to maximize the gains can be a tall order. Our dataset offers a valuable helping hand: with detailed insight into resource metrics and time-series analysis of over 400K hours of data, this treasure trove of information will hasten your journey towards finding just the right spot for all your job needs.

    Specifically, our dataset contains three files: dataset_classification.csv, which provides information on critical elements such as disk usage and CPU cache size; dataset_time_series_analysis.csv featuring in-depth takeaways from careful time series analysis; And finally dataset_400k_hour.csv gathering computation results from over 400K hours of testing! With columns such as Failure (indicating whether or not the job failed) TotalCpus (the total number of CPUs used by the job), CpuIsBusy (whether or not the CPU is busy), and SlotType (the type of slot used by the job), it's easier than ever to plot that perfect path to success!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can be used to help identify the most suitable site for GlideinWMS jobs. It contains resource metrics and time-series analysis, which can provide useful insight into the suitability of each potential site. The dataset consists of three sets: dataset_classification.csv, dataset_time_series_analysis.csv and dataset_400k_hour.csv.

    The first set provides a high-level view of the critical resource metrics that are essential when matching a job and a site: DiskUsage, TotalCpus, TotalMemory, TotalDisk, CpuCacheSize and TotalVirtualMemoryTotalSlots as well as total slot information all important criteria for any job matching process - including whether or not the CpuIsBusy - along with information about the SlotType for each job at each potential site; additionally there is also data regarding Failure should an issue arise during this process; finally Site is provided so that users can ensure they are matching jobs to sites within their own specific environment if required by policy or business rules.

    The second set provides detailed time-series analysis related to these metrics over longer timeframes as well LastUpdate indicating when this analysis was generated (without date), ydate indicating year of last update (without date), mdate indicating month of last update (without date) and hdate indicating hour at which data is refreshed on a regular basis without errors so that up-to-the minute decisions can be made during busy times like peak workloads or reallocations caused by anomalies in usage patterns within existing systems/environments;

    Finally our third set takes things one step further with detailed information related to our 400k+ hours analytical data collection allowing you maximize efficiency while selecting best possible matches across multiple sites/criteria using only one tool (which we have conveniently packaged together in this impressive kaggle datasets :)

    By taking advantage of our AI driven approach you will be able benefit from optimal job selection across many different scenarios such maximum efficiency scenarios with boosts in throughput through realtime scaling along with accountability boost ensuring proper system governance when moving from static systems utilizing static strategies towards ones more reactive working utilization dynamics within new agile deployments increasing stability while lowering maintenance costs over longer run!

    Research Ideas

    • Use the total CPU, memory and disk usage metrics to identify jobs that need additional resources to complete quickly and suggest alternatives sites with more optimal resource availability
    • Utilize the time-series analysis using failure rate, last update time series, as well as month/hour/year of last update metrics to create predictive models for job site matching and failure avoidance on future jobs
    • Identify inefficiencies in scheduling by cross-examining job types (slot type), CPU caching size requirements against historical data to find opportunities for optimization or new approaches to job organization

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1....

  7. Chicago Crime

    • kaggle.com
    Updated Sep 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashkan Ranjbar (2025). Chicago Crime [Dataset]. https://www.kaggle.com/ashkanranjbar/chicago-crime/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 18, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ashkan Ranjbar
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Area covered
    Chicago
    Description

    This dataset has gained popularity over time and is widely known. While Kaggle courses teach how to use Google BigQuery to extract a sample from it, this notebook provides a HOW-TO guide to access the dataset directly within your own notebook. Instead of uploading the entire dataset here, which is quite large, I offer several alternatives to work with a smaller portion of it. My main focus was to demonstrate various techniques to make the dataset more manageable on your own laptop, ensuring smoother operations. Additionally, I've included some interesting insights on basic descriptive statistics and even a modeling example, which can be further explored based on your preferences. I intend to revisit and refine it in the near future to enhance its rigor. Meanwhile, I welcome any suggestions to improve the notebook!

    Here are the columns that I have chosen to include (after carefully eliminating a few others):

    • Date: This column represents the timestamp of the incident. From this column, I have extracted the Month, Day, and Hour information. We can also add additional time-based columns such as Week and Day of the Week, among others.
    • Block: This column provides a partially redacted address where the incident occurred, indicating the same block as the actual address.
    • IUCR: The acronym stands for Illinois Uniform Crime Reporting. This code is directly linked to the Primary Type and Description. You can find more information about it in this link.
    • Primary Type: This column describes the primary category of the IUCR code mentioned above.
    • Description: This column provides a secondary description of the IUCR code, serving as a subcategory of the primary description.
    • Location Description: Here, you can find the description of the location where the incident took place.
    • Arrest: This column indicates whether an arrest was made in relation to the incident.
    • Domestic: It shows whether the incident was domestic-related, as defined by the Illinois Domestic Violence Act.
    • Beat: The beat refers to the smallest police geographic area, with each beat having a dedicated territory. You can find more information about it in this link.
    • District: This column represents the police district where the incident occurred.
    • Ward: It refers to the number that labels the City Council district where the incident took place.
    • Community Areas: This column indicates the community area where the incident occurred. Chicago has a total of 77 community areas.
    • FBI Code: The crime classification outlined in the FBI's National Incident-Based Reporting System (NIBRS).
    • X-Coordinate, Y-Coordinate, Latitude, Longitude, Location: These columns provide information about the geographical coordinates of the incident location, including latitude and longitude. The "Location" column contains just the latitude and longitude coordinates.
    • Year, Updated On: These columns represent the year of the incident and the date on which the dataset was last updated.

    Feel free to explore the notebook and provide any suggestions for improvement. Your feedback is highly appreciated!

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mojtaba Mostafavi; Mojtaba Mostafavi (2023). Assessing Computational Notebook Understandability through Code Metrics Analysis [Dataset]. http://doi.org/10.5281/zenodo.8435192
Organization logo

Assessing Computational Notebook Understandability through Code Metrics Analysis

Explore at:
zipAvailable download formats
Dataset updated
Oct 12, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mojtaba Mostafavi; Mojtaba Mostafavi
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the comprehension of the notebook code and identify the notebook metrics that play a significant role in their comprehension. The level of code comprehension is a qualitative variable closely associated with the user's opinion about the code. Previous studies have typically employed two approaches to measure it. One approach involves using limited questionnaire methods to review a small number of code pieces. Another approach relies solely on metadata, such as the number of likes and user votes for a project in the software repository. In our approach, we enhanced the measurement of the understandability level of notebook code by leveraging user comments within a software repository. As a case study, we started with 248,761 Kaggle Jupyter notebooks introduced in previous studies and their relevant metadata. To identify user comments associated with code comprehension within the notebooks, we utilized a fine-tuned DistillBERT transformer. We established a \emph{user comment based criterion} for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, the total views of the notebook, and the total upvotes received by the notebook. This criterion has proven to be more effective than alternative methods, making it the ground truth for evaluating the code comprehension of our notebook set. In addition, we collected a total of 34 metrics for 10,857 notebooks, categorized as script-based and notebook-based metrics. These metrics were utilized as features in our dataset. Using the Random Forest classifier, our predictive model achieved 85% accuracy in predicting code comprehension levels in computational notebooks, identifying developer expertise and markdown-based metrics as key factors.

Search
Clear search
Close search
Google apps
Main menu