Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the comprehension of the notebook code and identify the notebook metrics that play a significant role in their comprehension. The level of code comprehension is a qualitative variable closely associated with the user's opinion about the code. Previous studies have typically employed two approaches to measure it. One approach involves using limited questionnaire methods to review a small number of code pieces. Another approach relies solely on metadata, such as the number of likes and user votes for a project in the software repository. In our approach, we enhanced the measurement of the understandability level of notebook code by leveraging user comments within a software repository. As a case study, we started with 248,761 Kaggle Jupyter notebooks introduced in previous studies and their relevant metadata. To identify user comments associated with code comprehension within the notebooks, we utilized a fine-tuned DistillBERT transformer. We established a \emph{user comment based criterion} for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, the total views of the notebook, and the total upvotes received by the notebook. This criterion has proven to be more effective than alternative methods, making it the ground truth for evaluating the code comprehension of our notebook set. In addition, we collected a total of 34 metrics for 10,857 notebooks, categorized as script-based and notebook-based metrics. These metrics were utilized as features in our dataset. Using the Random Forest classifier, our predictive model achieved 85% accuracy in predicting code comprehension levels in computational notebooks, identifying developer expertise and markdown-based metrics as key factors.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
One of the most popular competitions on kaggle is the House Prices: Advanced Regression Techniques. The original data comes from the publication Dean De Cock "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project", Journal of Statistics Education, Volume 19, Number 3 (2011). Recently a 'demonstration' notebook has been published "First place is meaningless in this way!" that extracts the 'solution' from the full dataset. Now that the 'solution' is readily available the possibility has opened for people to reproduce the competition at home without any daily submission limit. This will open up the possibility of experimenting with advanced techniques such as pipelines with/or various estimators/models in the same notebook, extensive hyper-parameter tuning etc. And all without the risk of 'upsetting' the public leaderboard. Simply download this solution.csv file and import it into your script or notebook and evaluate the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the data in this file.
This dataset is the submission.csv file that will produce a public leaderboard score of 0.00000.
Facebook
TwitterIn 1983 Altman and Bland (B&A) proposed an alternative analysis, based on the quantification of the agreement between two quantitative measurements by studying the mean difference and constructing limits of agreement. https://www.kaggle.com/mpwolke/snip-test-bland-altman-analysis/notebook
Vanessa Resqueti, Gulherme Fregonezi, Layana Marques, Ana Lista-Paz, Ana Aline Marcelino and Rodrigo Torres-Castro, “Reliability of SNIP test in healthy children.” Kaggle, doi: 10.34740/KAGGLE/DSV/1539628.
The B&A plot analysis is a simple way to evaluate a bias between the mean differences, and to estimate an agreement interval, within which 95% of the differences of the second method, compared to the first one, fall. Data can be analyzed both as unit differences plot and as percentage differences plot. The B&A plot method only defines the intervals of agreements, it does not say whether those limits are acceptable or not. Acceptable limits must be defined a priori, based on clinical necessity, biological considerations or other goals.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4470095/
Altman and Bland (B&A)
Vanessa Resqueti, Gulherme Fregonezi, Layana Marques, Ana Lista-Paz, Ana Aline Marcelino and Rodrigo Torres-Castro, “Reliability of SNIP test in healthy children.” Kaggle, doi: 10.34740/KAGGLE/DSV/1539628.
https://www.kaggle.com/anaalinemarcelino/reiability-of-snip-test-in-healthy-children/metadata
Photo by Chromatograph on Unsplash
Bland Altman analysis.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Based on the notebook https://www.kaggle.com/code/lucamassaron/fine-tune-llama-2-for-sentiment-analysis this dataset contains the fine-tuned Llama 2 model trained on the annotated dataset of approximately 5,000 sentences from the Aalto University School of Business (Malo, P., Sinha, A., Korhonen, P., Wallenius, J., & Takala, P., 2014, “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the Association for Information Science and Technology, 65[4], 782–796 - https://arxiv.org/abs/1307.5336). This collection aimed to establish human-annotated benchmarks, serving as a standard for evaluating alternative modeling techniques. The involved annotators (16 people with adequate background knowledge of financial markets) were instructed to assess the sentences solely from an investor's perspective, evaluating whether the news potentially holds a positive, negative, or neutral impact on the stock price.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
By Health [source]
This dataset provides insight into the prevalence and trends in tobacco use across the United States. By breaking down this data by state, you can see how tobacco has been used and changed over time. Smoking is a major contributor to premature deaths and health complications, so understanding historic usage rates can help us analyze and hopefully reduce those negative impacts. Drawing from the Behavioral Risk Factor Surveillance System, this dataset gives us an unparalleled look at both current and historical smoking habits in each of our states. With this data, we can identify high risk areas and track changes throughout the years for better health outcomes overall
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains information on the prevalence and trends of tobacco use in the United States. The data is broken down by state, and includes percentages of smokers, former smokers, and those who have never smoked. With this dataset you can explore how smoking habits have changed over time as well as what regions of the country have seen more or less consistent smoking trends.
To begin using this dataset, you will first want to familiarize yourself with the columns included within it and their associated values. There is a “State” column that provides the US state for which each row refers to; there are also columns detailing percentages for those who smoke every day (Smoke Everyday), some days (Smoke Some Days), previously smoked (Former Smoker) and those who have never smoked (Never Smoked). The “Location 1” column indicates each geographic region that falls into one of either four US census divisions or eight regions based upon where each state lies in relation to one another.
Once you understand the data presented within these columns, there are a few different ways to begin exploring how tobacco use has changed throughout time including plotting prevalence data over different periods such as decades or specific years; compiling descriptive statistics such as percentiles or mean values; contrasting between states based on any relevant factors such as urban/rural population size or economic/political standing; and lastly looking at patterns developing throughout multiple years via various visualisations like box-and-whisker plots amongst other alternatives.
This wide set of possibilities makes this dataset interesting enough regardless if you are looking at regional differences across single points in time or long-term changes regarding national strategies around reducing nicotine consumption. With all its nuances uncovered hopefully your results can lead towards further research uncovering any aspect about smoking culture you may find fascinating!
- Comparing regional and state-level smoking rates and trends over time.
- Analyzing how different demographics are affected by state-level smoking trends, such as comparing gender or age-based differences in prevalence and/or decreasing or increasing rates of tobacco use at the regional level over time.
- Developing visualization maps that show changes in tobacco consumption prevalence (and related health risk factors) by location on an interactive website or tool for public consumption of data insights from this dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: Open Database License (ODbL) v1.0 - You are free to: - Share - copy and redistribute the material in any medium or format. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices. - No Derivatives - If you remix, transform, or build upon the material, you may not distribute the modified material. - No additional restrictions - You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
File: BRFSS_Prevalence_and_Trends_Data_Tobacco_Use_-_Four_Level_Smoking_Data_for_1995-2010.csv | Column name | ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
As you savvy job-seekers know, selecting an optimal site for GlideinWMS jobs is no small feat -weighing so many critical variables, and performing the highly sophisticated calculations needed to maximize the gains can be a tall order. Our dataset offers a valuable helping hand: with detailed insight into resource metrics and time-series analysis of over 400K hours of data, this treasure trove of information will hasten your journey towards finding just the right spot for all your job needs.
Specifically, our dataset contains three files: dataset_classification.csv, which provides information on critical elements such as disk usage and CPU cache size; dataset_time_series_analysis.csv featuring in-depth takeaways from careful time series analysis; And finally dataset_400k_hour.csv gathering computation results from over 400K hours of testing! With columns such as Failure (indicating whether or not the job failed) TotalCpus (the total number of CPUs used by the job), CpuIsBusy (whether or not the CPU is busy), and SlotType (the type of slot used by the job), it's easier than ever to plot that perfect path to success!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to help identify the most suitable site for GlideinWMS jobs. It contains resource metrics and time-series analysis, which can provide useful insight into the suitability of each potential site. The dataset consists of three sets: dataset_classification.csv, dataset_time_series_analysis.csv and dataset_400k_hour.csv.
The first set provides a high-level view of the critical resource metrics that are essential when matching a job and a site: DiskUsage, TotalCpus, TotalMemory, TotalDisk, CpuCacheSize and TotalVirtualMemoryTotalSlots as well as total slot information all important criteria for any job matching process - including whether or not the CpuIsBusy - along with information about the SlotType for each job at each potential site; additionally there is also data regarding Failure should an issue arise during this process; finally Site is provided so that users can ensure they are matching jobs to sites within their own specific environment if required by policy or business rules.
The second set provides detailed time-series analysis related to these metrics over longer timeframes as well LastUpdate indicating when this analysis was generated (without date), ydate indicating year of last update (without date), mdate indicating month of last update (without date) and hdate indicating hour at which data is refreshed on a regular basis without errors so that up-to-the minute decisions can be made during busy times like peak workloads or reallocations caused by anomalies in usage patterns within existing systems/environments;
Finally our third set takes things one step further with detailed information related to our 400k+ hours analytical data collection allowing you maximize efficiency while selecting best possible matches across multiple sites/criteria using only one tool (which we have conveniently packaged together in this impressive kaggle datasets :)
By taking advantage of our AI driven approach you will be able benefit from optimal job selection across many different scenarios such maximum efficiency scenarios with boosts in throughput through realtime scaling along with accountability boost ensuring proper system governance when moving from static systems utilizing static strategies towards ones more reactive working utilization dynamics within new agile deployments increasing stability while lowering maintenance costs over longer run!
- Use the total CPU, memory and disk usage metrics to identify jobs that need additional resources to complete quickly and suggest alternatives sites with more optimal resource availability
- Utilize the time-series analysis using failure rate, last update time series, as well as month/hour/year of last update metrics to create predictive models for job site matching and failure avoidance on future jobs
- Identify inefficiencies in scheduling by cross-examining job types (slot type), CPU caching size requirements against historical data to find opportunities for optimization or new approaches to job organization
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1....
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset has gained popularity over time and is widely known. While Kaggle courses teach how to use Google BigQuery to extract a sample from it, this notebook provides a HOW-TO guide to access the dataset directly within your own notebook. Instead of uploading the entire dataset here, which is quite large, I offer several alternatives to work with a smaller portion of it. My main focus was to demonstrate various techniques to make the dataset more manageable on your own laptop, ensuring smoother operations. Additionally, I've included some interesting insights on basic descriptive statistics and even a modeling example, which can be further explored based on your preferences. I intend to revisit and refine it in the near future to enhance its rigor. Meanwhile, I welcome any suggestions to improve the notebook!
Here are the columns that I have chosen to include (after carefully eliminating a few others):
Feel free to explore the notebook and provide any suggestions for improvement. Your feedback is highly appreciated!
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the comprehension of the notebook code and identify the notebook metrics that play a significant role in their comprehension. The level of code comprehension is a qualitative variable closely associated with the user's opinion about the code. Previous studies have typically employed two approaches to measure it. One approach involves using limited questionnaire methods to review a small number of code pieces. Another approach relies solely on metadata, such as the number of likes and user votes for a project in the software repository. In our approach, we enhanced the measurement of the understandability level of notebook code by leveraging user comments within a software repository. As a case study, we started with 248,761 Kaggle Jupyter notebooks introduced in previous studies and their relevant metadata. To identify user comments associated with code comprehension within the notebooks, we utilized a fine-tuned DistillBERT transformer. We established a \emph{user comment based criterion} for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, the total views of the notebook, and the total upvotes received by the notebook. This criterion has proven to be more effective than alternative methods, making it the ground truth for evaluating the code comprehension of our notebook set. In addition, we collected a total of 34 metrics for 10,857 notebooks, categorized as script-based and notebook-based metrics. These metrics were utilized as features in our dataset. Using the Random Forest classifier, our predictive model achieved 85% accuracy in predicting code comprehension levels in computational notebooks, identifying developer expertise and markdown-based metrics as key factors.