Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.
The datasets are available under directory dataset. There are 4 datasets in this directory.
In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.
The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.
More specifically, git_token handles GitHub API token that is necessary for requests to GitHub API. Script collector performs GitHub search. Tracing changed lines and git annotate is done in gitminer using PyDriller. Finally, gumtree applies 4 filtering steps (number of lines, number of files, language, and change significance).
References:
1. GumTree
Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324
2. PyDriller
Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rising trend of scientific researches have led more people to pay their attention towards scientific researches, but simply the word "scientific research" does not explain the whole nature of itself, like any other things in reality, it is divided into many realms. The various fields of scientific research have already been discussed by many scholarly articles and have been evaluated by previous census and researches. However, the ultimate question remains unanswered, namely, what is the most popular field of scientific research and which one will become the focus in the future. Although the number of specific fields that can be derived is too vast to be counted, numerous major fields can be identified to categorize the various fields, such as astronomy, engineering, computer science, medicine, biology and chemistry. Several main factors are related to the popularity, such as the number of articles relating to respective fields, number of posts on social media and the number of views on professional sites. A program was developed to analyze the relationship between the subjects for scientific research and the future trend of them based on the number of mentions for each field of research, scholarly articles and quotations about them. The program uses the data from Altmetric data, an authoritative data source. SAS is used to analyze the data and put the data on several graphs that represent the value for each factor. Finally, suggestions for future scientific researches can be summarized and inferred from the result of this research, which is aimed to provide enlightenment for future research directions.Fig 1 - The functions used in this research.Fig 2 - The main Python program used in this research.Fig 3 - The structure of output.Fig 4 - Factor 1: Number of articles relating to each field.Fig 5 - Factor 2: Number of views on Mendeley, Connotea, and Citeulike.Fig 6 - Factor 3: Number of posts on Facebook and Twitter.Fig 7 - The correlation between individual factors.
Dataset Card for Python-DPO
This dataset is the larger version of Python-DPO dataset and has been created using Argilla.
Load with datasets
To load this dataset with datasets, you'll just need to install datasets as pip install datasets --upgrade and then use the following code: from datasets import load_dataset
ds = load_dataset("NextWealth/Python-DPO")
Data Fields
Each data instance contains:
instruction: The problem description/requirements chosen_code:… See the full description on the dataset page: https://huggingface.co/datasets/NextWealth/Python-DPO-Large.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The explosion in biological data generation challenges the available technologies and methodologies for data interrogation. Moreover, highly rich and complex datasets together with diverse linked data are difficult to explore when provided in flat files. Here we provide a way to filter and analyse in a systematic way a dataset with more than 18 thousand data points using Zegami, a solution for interactive data visualisation and exploration. The primary data we use are derived from a systematic analysis of 200 YFP gene traps reveals common discordance between mRNA and protein across the nervous system which is submitted elsewhere. This manual provides the raw image data together with annotations and associated data and explains how to use Zegami for exploring all these data types together by providing specific examples. We also provide the open source python code used to annotate the figures.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are demo data files used to teach machine learning with Python in 3011979 course at Chulalongkorn University in Spring 2021 and Spring 2022
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.
The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).
Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.
Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.
The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).
Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This research data file contains the necessary software and the dataset for estimating the missing prices of house units. This approach combines several machine learning techniques (linear regression, support vector regression, the k-nearest neighbors and a multi-layer perceptron neural network) with several dimensionality reduction techniques (non-negative factorization, recursive feature elimination and feature selection with a variance threshold). It includes the input dataset formed with the available house prices in two neighborhoods of Teruel city (Spain) in November 13, 2017 from Idealista website. These two neighborhoods are the center of the city and “Ensanche”.
This dataset supports the research of the authors in the improvement of the setup of agent-based simulations about real-estate market. The work about this dataset has been submitted for consideration for publication to a scientific journal.
The open source python code is composed of all the files with the “.py” extension. The main program can be executed from the “main.py” file. The “boxplotErrors.eps” is a chart generated from the execution of the code, and compares the results of the different combinations of machine learning techniques and dimensionality reduction methods.
The dataset is in the “data” folder. The input raw data of the house prices are in the “dataRaw.csv” file. These were shuffled into the “dataShuffled.csv” file. We used cross-validation to obtain the estimations of house prices. The outputted estimations alongside the real values are stored in different files of the “data” folder, in which each filename is composed by the machine learning technique abbreviation and the dimensionality reduction method abbreviation.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
함께 놀아볼까요? 무궁화 꽃이 피었습니다 😜 빅데이터 분석기사 실기 준비를 위한 데이터 셋입니다. 더 좋은 코드를 만든다면 많은 공유 부탁드려요🎉 (Python과 R모두 환영합니다.)
분류(3회 기출 심화 변형) : https://www.kaggle.com/code/agileteam/3rd-type2-3-2-baseline
작업형1 (3회 기출 유형)
작업형1 모의문제2(심화) https://www.kaggle.com/code/agileteam/mock-exam2-type1-1-2
Tasks 탭에서 문제 및 코드 확인
[2회차 기출 유형] 작업형1 P: https://www.kaggle.com/agileteam/tutorial-t1-2-python R: https://www.kaggle.com/limmyoungjin/tutorial-t1-2-r-2
공식 예시문제(작업형1) P: https://www.kaggle.com/agileteam/tutorial-t1-python R: https://www.kaggle.com/limmyoungjin/tutorial-t1-r
T1-1.Outlier(IQR) / #이상치 #IQR P: https://www.kaggle.com/agileteam/py-t1-1-iqr-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-1-iqr-expected-questions-2
T1-2.Outlier(age) / #이상치 #소수점나이 P: https://www.kaggle.com/agileteam/py-t1-2-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-2-expected-questions-2
T1-3. Missing data / #결측치 #삭제 #중앙 #평균 P: https://www.kaggle.com/agileteam/py-t1-3-map-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-3-expected-questions-2
T1-4. Skewness and Kurtosis (Log Scale) / #왜도 #첨도 #로그스케일 P: https://www.kaggle.com/agileteam/py-t1-4-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-4-expected-questions-2
T1-5. Standard deviation / #표준편차 P: https://www.kaggle.com/agileteam/py-t1-5-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-5-expected-questions-2
T1-6. Groupby Sum / #결측치 #조건 P: https://www.kaggle.com/agileteam/py-t1-6-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-6-expected-questions-2
T1-7. Replace / #값변경 #조건 #최대값 P: https://www.kaggle.com/agileteam/py-t1-7-2-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-7-2-expected-questions-2
T1-8. Cumulative Sum / #누적합 #결측치 #보간 P: https://www.kaggle.com/agileteam/py-t1-8-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-8-expected-questions-2
T1-9. Standardization / #표준화 #중앙값 P: https://www.kaggle.com/agileteam/py-t1-9-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-9-expected-questions-2
T1-10. Yeo-Johnson and Box–Cox / #여존슨 #박스-콕스 #결측치 #최빈값 P: https://www.kaggle.com/agileteam/py-t1-10-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-10-expected-questions-2
T1-11. min-max scaling / #스케일링 #상하위값 P: https://www.kaggle.com/agileteam/py-t1-11-min-max-5-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-11-min-max-5-expected-questions-2
T1-12. top10-bottom10 / #그룹핑 #정렬 #상하위값 P: https://www.kaggle.com/agileteam/py-t1-12-10-10-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-12-10-expected-questions-2
T1-13. Correlation / #상관관계 P: https://www.kaggle.com/agileteam/py-t1-13-expected-questions R: https://www.kaggle.com/limmyoungjin/r-t1-13-expected-questions-2
T1-14. Multi Index & Groupby / #멀티인덱스 #정렬 #인덱스리셋 #상위값 P: https://www.kaggle.com/agileteam/py-t1-14-2-expected-question R: https://www.kaggle.com/limmyoungjin/r-t1-14-2-expected-question-2
T1-15. Slicing & Condition / #슬라이싱 #결측치 #중앙값 #조건 P: https://www.kaggle.com/agileteam/py-t1-15-expected-question R: https://www.kaggle.com/limmyoungjin/r-t1-15-expected-question-2
T1-16. Variance / #분산 #결측치전후값차이 P: https://www.kaggle.com/agileteam/py-t1-16-expected-question R: https://www.kaggle.com/limmyoungjin/r-t1-16-expected-question-2
T1-17. Time-Series1 / #시계열데이터 #datetime P: https://www.kaggle.com/agileteam/py-t1-17-1-expected-question R: https://www.kaggle.com/limmyoungjin/r-t1-17-1-expected-question-2
T1-18. Time-Series2 / #주말 #평일 #비교 #시계열데이터 P: https://www.kaggle.com/agileteam/py-t1-18-2-expected-question R: https://www.kaggle.com/limmyoungjin/r-t1-18-2-expected-question-2
T1-19. Time-Series3 (monthly total) / #월별 #총계 #비교 #데이터값변경
P: https://www.kaggle.com/agileteam/py-t1-19-3-expected-question
R: https://www.kaggle.com/limmyoungjin/r-t1-19-3-expected-question-2
T1-20. Combining Data / 데이터 #병합 #결합 / 고객과 궁합이 맞는 타입 매칭
P: https://www.kaggle.com/agileteam/py-t1-20-expected-question
R: https://www.kaggle.com/limmyoungjin/r-t1-20-expected-question-2
T1-21. Binning Data / #비닝 #구간나누기 P: https://www.kaggle.com/agileteam/py-t1-21-expected-question R: https://www.kaggle.com/limmyoungjin/r-t1-21-expected-question-2
T1-22. Time-Series4 (Weekly data) / #주간 #합계 P: https://www.kaggle.com/agileteam/t1-22-time-series4-weekly-data R: https://www.kaggle.com/limmyoungjin/r-t1-22-time-series4-weekly-data-2
T1-23. Drop Duplicates / #중복제거 #결측치 #10번째값으로채움 P: https://www.kaggle.com/agileteam/t1-23-drop-duplicates R: https://www.kaggle.com/limmyoungjin/r-t1-23-drop-duplicates-2
T1-24. Time-Series5 (Lagged Feature) / #시차데이터 #조건 P: https://www.kaggle.com/agileteam/t1-24-time-series5-lagged-feature R: https://www.kaggle.com/limmyoungjin/r-t1-24-time-series5-2
[MOCK EXAM1] TYPE1 / 작업형1 모의고사 P: https://www.kaggle.com/agileteam/mock-exam1-type1-1-tutorial R: https://www.kaggle.com/limmyoungjin/mock-exam1-type1-1
[MOCK EXAM2] TYPE1 / 작업형1 모의고사2 P: https://www.kaggle.com/code/agileteam/mock-exam2-type1-1-2
Tasks 탭에서 문제 및 코드 확인 - [3회차 기출유형 작업형2] : 여행 보험 패키지 상품 (데이터를 조금 어렵게 변경함) P: https://www.kaggle.com/code/agileteam/3rd-type2-3-2-baseline
[2회차 기출유형 작업형2] : E-Commerce Shipping Data P: https://www.kaggle.com/agileteam/tutorial-t2-2-python R: https://www.kaggle.com/limmyoungjin/tutorial-t2-2-r
T2. Exercise / 예시문제 : 백화점고객의 1년간 데이터 (dataq 공식 예제) P: https://www.kaggle.com/agileteam/t2-exercise-tutorial-baseline
T2-1. Titanic (Classification) / 타이타닉 P: https://www.kaggle.com/agileteam/t2-1-titanic-simple-baseline R: https://www.kaggle.com/limmyoungjin/r-t2-1-titanic
T2-2. Pima Indians Diabetes (Classification) / 당뇨병 P: https://www.kaggle.com/agileteam/t2-2-pima-indians-diabetes R: https://www.kaggle.com/limmyoungjin/r-t2-2-pima-indians-diabetes
T2-3. Adult Census Income (Classification) / 성인 인구소득 예측 P: https://www.kaggle.com/agileteam/t2-3-adult-census-income-tutorial R: https://www.kaggle.com/limmyoungjin/r-t2-3-adult-census-income
T2-4. House Prices (Regression) / 집값 예측 / RMSE P: https://www.kaggle.com/code/blighpark/t2-4-house-prices-regression R: https://www.kaggle.com/limmyoungjin/r-t2-4-house-prices
T2-5. Insurance Forecast (Regression) / P: https://www.kaggle.com/agileteam/insurance-starter-tutorial R: https://www.kaggle.com/limmyoungjin/r-t2-5-insurance-prediction
T2-6. Bike-sharing-demand (Regression) / 자전거 수요 예측 / RMSLE P: R: https://www.kaggle.com/limmyoungjin/r-t2-6-bike-sharing-demand
[MOCK EXAM1] TYPE2. HR-DATA / 작업형2 모의고사 P: https://www.kaggle.com/agileteam/mock-exam-t2-exam-template(템플릿만 제공) https://www.kaggle.com/agileteam/mock-exam-t2-starter-tutorial (찐입문자용) https://www.kaggle.com/agileteam/mock-exam-t2-baseline-tutorial (베이스라인)
주차 | 유형(에디터) | 번호 |
---|---|---|
6주 전 | 작업형1(노트북) | T1-1~5 |
5주 전 | 작업형1(노트북) | T1-6~9, T1 EQ(기출), |
4주 전 | 작업형1(스크립트), 작업형2(노트북) | T1-10~13, T1.Ex, T2EQ, T2-1 |
3주 전 | 작업형1(스크립트), 작업형2(노트북) | T1-14~19, T2-2~3 |
2주 전 | 작업형1(스크립트), 작업형2(스크립트) | T1-20~21, T2-4~6, 복습 |
1주 전 | 작업형1, 작업형2(스크립트), 단답형 | T1-22~24, 모의고사, 복습, 응시환경 체험, 단답 |
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The Python Package Software market is experiencing robust growth, driven by the increasing adoption of Python in various industries and the rising demand for efficient and specialized software solutions. The market's expansion is fueled by the large and active Python community constantly developing and refining packages for diverse applications, from web development and data science to machine learning and automation. While precise market sizing is unavailable, considering the widespread use of Python and the significant contribution of open-source packages, a reasonable estimate for the 2025 market size could be around $5 billion, projecting a Compound Annual Growth Rate (CAGR) of 15% over the forecast period (2025-2033). This growth is primarily driven by the increasing complexity of software projects demanding specialized functionality readily available through packages, the need for faster development cycles, and the cost-effectiveness of leveraging pre-built components. Key trends include the rise of cloud-based Python package management, the growing importance of security and maintainability in package selection, and the increasing specialization of packages for niche applications. Constraints on market growth might include challenges in ensuring package quality and security, as well as the learning curve associated with integrating and managing diverse packages within large projects. The market is segmented into cloud-based and web-based solutions, catering to large enterprises and SMEs, with North America and Europe currently holding the largest market shares. The diverse range of packages, from those focusing on data manipulation (Pandas, NumPy) and web frameworks (Django, Flask) to machine learning libraries (Scikit-learn, TensorFlow) and GUI development (Tkinter, PyQt), underscores the market's versatility. The significant contribution of open-source packages fosters a collaborative environment and continuous improvement. However, challenges remain in effectively managing the vast ecosystem of packages, addressing security vulnerabilities, and ensuring interoperability. The future growth will hinge on addressing these challenges, fostering standardization, and further improving the accessibility and user experience of Python package management systems. Continued innovation within the Python ecosystem and broader industry trends such as the rise of AI and big data will further propel the market's expansion.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GlobalHighPM2.5 is part of a series of long-term, seamless, global, high-resolution, and high-quality datasets of air pollutants over land (i.e., GlobalHighAirPollutants, GHAP). It is generated from big data sources (e.g., ground-based measurements, satellite remote sensing products, atmospheric reanalysis, and model simulations) using artificial intelligence, taking into account the spatiotemporal heterogeneity of air pollution.
This dataset contains input data, analysis codes, and generated dataset used for the following article. If you use the GlobalHighPM2.5 dataset in your scientific research, please cite the following reference (Wei et al., NC, 2023):
Wei, J., Li, Z., Lyapustin, A., Wang, J., Dubovik, O., Schwartz, J., Sun, L., Li, C., Liu, S., and Zhu, T. First close insight into global daily gapless 1 km PM2.5 pollution, variability, and health impact. Nature Communications, 2023, 14, 8349. https://doi.org/10.1038/s41467-023-43862-3
Input Data
Relevant raw data for each figure (compiled into a single sheet within an Excel document) in the manuscript.
Code
Relevant Python scripts for replicating and ploting the analysis results in the manuscript, as well as codes for converting data formats.
Generated Dataset
Here is the first big data-derived seamless (spatial coverage = 100%) daily, monthly, and yearly 1 km (i.e., D1K, M1K, and Y1K) global ground-level PM2.5 dataset over land from 2017 to the present. This dataset exhibits high quality, with cross-validation coefficients of determination (CV-R2) of 0.91, 0.97, and 0.98, and root-mean-square errors (RMSEs) of 9.20, 4.15, and 2.77 µg m-3 on the daily, monthly, and annual bases, respectively.
Due to data volume limitations,
all (including daily) data for the year 2022 is accessible at: GlobalHighPM2.5 (2022)
all (including daily) data for the year 2021 is accessible at: GlobalHighPM2.5 (2021)
all (including daily) data for the year 2020 is accessible at: GlobalHighPM2.5 (2020)
all (including daily) data for the year 2019 is accessible at: GlobalHighPM2.5 (2019)
all (including daily) data for the year 2018 is accessible at: GlobalHighPM2.5 (2018)
all (including daily) data for the year 2017 is accessible at: GlobalHighPM2.5 (2017)
continuously updated...
More GHAP datasets for different air pollutants are available at: https://weijing-rs.github.io/product.html
We implemented automated workflows using Jupyter notebooks for each state. The GIS processing, crucial for merging, extracting, and projecting GeoTIFF data, was performed using ArcPy—a Python package for geographic data analysis, conversion, and management within ArcGIS (Toms, 2015). After generating state-scale LES (large extent spatial) datasets in GeoTIFF format, we utilized the xarray and rioxarray Python packages to convert GeoTIFF to NetCDF. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. Xarray facilitated data manipulation and metadata addition in the NetCDF file, while rioxarray was used to save GeoTIFF as NetCDF. These procedures resulted in the creation of three HydroShare resources (HS 3, HS 4 and HS 5) for sharing state-scale LES datasets. Notably, due to licensing constraints with ArcGIS Pro, a commercial GIS software, the Jupyter notebook development was undertaken on a Windows OS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scientific and related management challenges in the water domain require synthesis of data from multiple domains. Many data analysis tasks are difficult because datasets are large and complex; standard formats for data types are not always agreed upon nor mapped to an efficient structure for analysis; water scientists may lack training in methods needed to efficiently tackle large and complex datasets; and available tools can make it difficult to share, collaborate around, and reproduce scientific work. Overcoming these barriers to accessing, organizing, and preparing datasets for analyses will be an enabler for transforming scientific inquiries. Building on the HydroShare repository’s established cyberinfrastructure, we have advanced two packages for the Python language that make data loading, organization, and curation for analysis easier, reducing time spent in choosing appropriate data structures and writing code to ingest data. These packages enable automated retrieval of data from HydroShare and the USGS’s National Water Information System (NWIS), loading of data into performant structures keyed to specific scientific data types and that integrate with existing visualization, analysis, and data science capabilities available in Python, and then writing analysis results back to HydroShare for sharing and eventual publication. These capabilities reduce the technical burden for scientists associated with creating a computational environment for executing analyses by installing and maintaining the packages within CUAHSI’s HydroShare-linked JupyterHub server. HydroShare users can leverage these tools to build, share, and publish more reproducible scientific workflows. The HydroShare Python Client and USGS NWIS Data Retrieval packages can be installed within a Python environment on any computer running Microsoft Windows, Apple MacOS, or Linux from the Python Package Index using the PIP utility. They can also be used online via the CUAHSI JupyterHub server (https://jupyterhub.cuahsi.org/) or other Python notebook environments like Google Collaboratory (https://colab.research.google.com/). Source code, documentation, and examples for the software are freely available in GitHub at https://github.com/hydroshare/hsclient/ and https://github.com/USGS-python/dataretrieval.
This presentation was delivered as part of the Hawai'i Data Science Institute's regular seminar series: https://datascience.hawaii.edu/event/data-science-and-analytics-for-water/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scientific and management challenges in the water domain require synthesis of diverse data. Many data analysis tasks are difficult because datasets are large and complex; standard data formats are not always agreed upon or mapped to efficient structures for analysis; scientists may lack training for tackling large and complex datasets; and it can be difficult to share, collaborate around, and reproduce scientific work. Overcoming barriers to accessing, organizing, and preparing datasets for analyses can transform the way water scientists work. Building on the HydroShare repository’s cyberinfrastructure, we have advanced two Python packages that make data loading, organization, and curation for analysis easier, reducing time spent in choosing appropriate data structures and writing code to ingest data. These packages enable automated retrieval of data from HydroShare and the USGS’s National Water Information System (NWIS) (i.e., a Python equivalent of USGS’ R dataRetrieval package), loading data into performant structures that integrate with existing visualization, analysis, and data science capabilities available in Python, and writing analysis results back to HydroShare for sharing and publication. While these Python packages can be installed for use within any Python environment, we will demonstrate how the technical burden for scientists associated with creating a computational environment for executing analyses can be reduced and how sharing and reproducibility of analyses can be enhanced through the use of these packages within CUAHSI’s HydroShare-linked JupyterHub server.
This HydroShare resource includes all of the materials presented in a workshop at the 2023 CUAHSI Biennial Colloquium.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is designed for skill gap analysis, focusing on evaluating the skill gap between students’ current skills and industry requirements. It provides insights into technical skills, soft skills, career interests, and challenges, helping in skill gap analysis to identify areas for improvement.
By leveraging this dataset, educators, recruiters, and researchers can conduct skill gap analysis to assess students’ job readiness and tailor training programs accordingly. It serves as a valuable resource for identifying skill deficiencies and skill gaps improving career guidance, and enhancing curriculum design through targeted skill gap analysis.
Following is the column descriptors: Name - Student's full name. email_id - Student's email address. Year - The academic year the student is currently in (e.g., 1st Year, 2nd Year, etc.). Current Course - The course the student is currently pursuing (e.g., B.Tech CSE, MBA, etc.). Technical Skills - List of technical skills possessed by the student (e.g., Python, Data Analysis, Cloud Computing). Programming Languages - Programming languages known by the student (e.g., Python, Java, C++). Rating - Self-assessed rating of technical skills on a scale of 1 to 5. Soft Skills - List of soft skills (e.g., Communication, Leadership, Teamwork). Rating - Self-assessed rating of soft skills on a scale of 1 to 5. Projects - Indicates whether the student has worked on any projects (Yes/No). Career Interest - The student's preferred career path (e.g., Data Scientist, Software Engineer). Challenges - Challenges faced while applying for jobs/internships (e.g., Lack of experience, Resume building issues).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-03-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.7:
conda create -n analyses python=3.7
conda activate analyses
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
<code
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the Julia code package for the Bayesian SVM algorithm described in the ECML PKDD 2017 paper; Wenzel et al.: Bayesian Nonlinear Support Vector Machines for Big Data.Files are provided in .jl format; containing Julia language code: a high-performance dynamic programming language for numerical computing. These files can be accessed by openly available text edit software. To run the code please see the description below or the more detailed wiki BSVM.jl - contains the module to run the Bayesian SVM algorithm.AFKMC2.jl - File for the Assumption Free K MC2 algorithm (KMeans)KernelFunctions.jl - Module for the kernel typeDataAccess.jl - Module for either generating data or exporting from an existing datasetrun_test.jl and paper_experiments.jl - Modules to run on a file and compute accuracy on a nFold cross validation, also to compute the brier score and the logscoretest_functions.jl and paper_experiment_functions.jl - Sets of datatype and functions for efficient testing.ECM.jl - Module for expectation conditional maximization (ECM) for nonlinear Bayesian SVMFor datasets used in the related experiments please see https://doi.org/10.6084/m9.figshare.5443621RequirementsThe BayesianSVM only works for version of Julia > 0.5. Other necessary packages will automatically be added in the installation. It is also possible to run the package from Python, to do so please check Pyjulia. If you prefer to use R you have the possibility to use RJulia. All these are a bit technical due to the fact that Julia is still a young package.InstallationTo install the last version of the package in Julia run Pkg.clone("git://github.com/theogf/BayesianSVM.jl.git")Running the AlgorithmHere are the basic steps for using the algorithm : using BayesianSVM Model = BSVM(X_training,y_training) Model.Train() y_predic = sign(Model.Predict(X_test)) y_uncertaintypredic = Model.PredictProb(X_test) Where X_training should be a matrix of size NSamples x NFeatures, and y_training should be a vector of 1 and -1You can find a more complete description in the WikiBackgroundWe propose a fast inference method for Bayesian nonlinear support vector machines that leverages stochastic variational inference and inducing points. Our experiments show that the proposed method is faster than competing Bayesian approaches and scales easily to millions of data points. It provides additional features over frequentist competitors such as accurate predictive uncertainty estimates and automatic hyperparameter search.Please also check out our github repository:github.com/theogf/BayesianSVM.jl
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The Python Compiler market size was valued at USD 341.2 million in 2023, and it is projected to grow at a robust compound annual growth rate (CAGR) of 10.8% to reach USD 842.6 million by 2032. The market growth is primarily driven by the increasing adoption of Python as a programming language across various industries, due to its simplicity and versatility.
The growth of the Python Compiler market can be attributed to several key factors. Firstly, the rising prominence of Python in data science, machine learning, and artificial intelligence domains is a significant driver. Python’s extensive libraries and frameworks make it an ideal choice for data processing and algorithm development, leading to increased demand for efficient Python compilers. This widespread application is spurring investments and advancements in compiler technologies to support increasingly complex computational tasks. Additionally, the open-source nature of Python encourages innovation and customization, further fueling market expansion.
Secondly, the educational sector's growing emphasis on coding and computer science education is another pivotal growth factor. Python is often chosen as the introductory programming language in educational institutions due to its readability and straightforward syntax. This trend is creating a steady demand for Python compilers that are user-friendly and suitable for educational purposes. As more schools and universities integrate Python into their curriculums, the market for Python compilers is expected to grow correspondingly, supporting a new generation of programmers and developers.
Furthermore, the increasing adoption of Python by small and medium enterprises (SMEs) is propelling the market forward. SMEs are leveraging Python for various applications, including web development, automation, and data analysis, due to its cost-effectiveness and ease of use. Python’s versatility allows businesses to streamline their operations and develop robust solutions without significant financial investment. This has led to a burgeoning demand for both on-premises and cloud-based Python compilers that can cater to the diverse needs of SMEs across different sectors.
Regionally, the Python Compiler market is witnessing notable growth in North America and the Asia Pacific. North America remains a key market due to the early adoption of advanced technologies and a strong presence of tech giants and startups alike. In contrast, the Asia Pacific region is experiencing rapid growth thanks to its expanding technological infrastructure and burgeoning IT industry. Countries like India and China are emerging as significant players due to their large pool of skilled developers and increasing investment in tech education and innovation.
In the Python Compiler market, the component segment is divided into software and services. The software segment encompasses the actual compiler tools and integrated development environments (IDEs) that developers use to write and optimize Python code. This segment is crucial as it directly impacts the efficiency and performance of Python applications. The demand for advanced compiler software is on the rise due to the need for high-performance computing in areas like machine learning, artificial intelligence, and big data analytics. Enhanced features such as real-time error detection, optimization techniques, and seamless integration with other development tools are driving the adoption of sophisticated Python compiler software.
The services segment includes support, maintenance, consulting, and training services associated with Python compilers. As organizations increasingly adopt Python for critical applications, the need for professional services to ensure optimal performance and scalability is growing. Consulting services help businesses customize and optimize their Python environments to meet specific needs, while training services are essential for upskilling employees and staying competitive in the tech-driven market. Additionally, support and maintenance services ensure that the compilers continue to operate efficiently and securely, minimizing downtime and enhancing productivity.
Within the software sub-segment, integrated development environments (IDEs) like PyCharm, Spyder, and Jupyter Notebooks are gaining traction. These IDEs not only provide robust compiling capabilities but also offer features like debugging, syntax highlighting, and version control, which streamline the development process. The increasing complexity of software develo
This item contains data and code used in experiments that produced the results for Sadler et. al (2022) (see below for full reference). We ran five experiments for the analysis, Experiment A, Experiment B, Experiment C, Experiment D, and Experiment AuxIn. Experiment A tested multi-task learning for predicting streamflow with 25 years of training data and using a different model for each of 101 sites. Experiment B tested multi-task learning for predicting streamflow with 25 years of training data and using a single model for all 101 sites. Experiment C tested multi-task learning for predicting streamflow with just 2 years of training data. Experiment D tested multi-task learning for predicting water temperature with over 25 years of training data. Experiment AuxIn used water temperature as an input variable for predicting streamflow. These experiments and their results are described in detail in the WRR paper. Data from a total of 101 sites across the US was used for the experiments. The model input data and streamflow data were from the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) dataset (Newman et. al 2014, Addor et. al 2017). The water temperature data were gathered from the National Water Information System (NWIS) (U.S. Geological Survey, 2016). The contents of this item are broken into 13 files or groups of files aggregated into zip files:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data files for the COINE 2020 Paper - "Mining International Political Norms from the GDELT Database".This file contains bilateral government event sequences extracted from the GDELT database for the period 19 June 2018 to 20 June 2019Use numpy.load() to load up the fileCode repository can be found here - https://bitbucket.org/SuraviMsc/suravi-msc-python/src/master/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.
The datasets are available under directory dataset. There are 4 datasets in this directory.
In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.
The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.
More specifically, git_token handles GitHub API token that is necessary for requests to GitHub API. Script collector performs GitHub search. Tracing changed lines and git annotate is done in gitminer using PyDriller. Finally, gumtree applies 4 filtering steps (number of lines, number of files, language, and change significance).
References:
1. GumTree
Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324
2. PyDriller
Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911