Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Greetings , fellow analysts !
(NOTE : This is a random dataset generated using python. It bears no resemblance to any real entity in the corporate world. Any resemblance is a matter of coincidence.)
REC-SSEC Bank is a govt-aided bank operating in the Indian Peninsula. They have regional branches in over 40+ regions of the country. You have been provided with a massive excel sheet containing the transaction details, the total transaction amount and their location and total transaction count.
The dataset is described as follows :
For example , in the very first row , the data can be read as : " On the first of January, 2022 , 1932 transactions of summing upto INR 365554 from Bhuj were reported " NOTE : There are about 2750 transactions every single day. All of this has been given to you.
The bank wants you to answer the following questions :
Facebook
TwitterExcel spreadsheets by species (4 letter code is abbreviation for genus and species used in study, year 2010 or 2011 is year data collected, SH indicates data for Science Hub, date is date of file preparation). The data in a file are described in a read me file which is the first worksheet in each file. Each row in a species spreadsheet is for one plot (plant). The data themselves are in the data worksheet. One file includes a read me description of the column in the date set for chemical analysis. In this file one row is an herbicide treatment and sample for chemical analysis (if taken). This dataset is associated with the following publication: Olszyk , D., T. Pfleeger, T. Shiroyama, M. Blakely-Smith, E. Lee , and M. Plocher. Plant reproduction is altered by simulated herbicide drift toconstructed plant communities. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 36(10): 2799-2813, (2017).
Facebook
TwitterThe Large Truck* Crash Causation Study (LTCCS) is based on a three-year data collection project conducted by the Federal Motor Carrier Safety Administration (FMCSA) and the National Highway Traffic Safety Administration (NHTSA) of the U.S. Department of Transportation (DOT). LTCCS is the first-ever national study to attempt to determine the critical events and associated factors that contribute to serious large truck crashes allowing DOT and others to implement effective countermeasures to reduce the occurrence and severity of these crashes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets.
The PTB-XL ECG dataset is a large dataset of 21799 clinical 12-lead ECGs from 18869 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. The in total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements. To ensure comparability of machine learning algorithms trained on the dataset, we provide recommended splits into training and test sets. In combination with the extensive annotation, this turns the dataset into a rich resource for the training and the evaluation of automatic ECG interpretation algorithms. The dataset is complemented by extensive metadata on demographics, infarction characteristics, likelihoods for diagnostic ECG statements as well as annotated signal properties.
Facebook
TwitterOpen Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The National Pollutant Release Inventory (NPRI) is Canada's public inventory of pollutant releases (to air, water and land), disposals and transfers for recycling. Each file contains data from 1993 to the latest reporting year. These CSV format datasets are in normalized or ‘list’ format and are optimized for pivot table analyses. Here is a description of each file: - The RELEASES file contains all substance release quantities. - The DISPOSALS file contains all on-site and off-site disposal quantities, including tailings and waste rock (TWR). - The TRANSFERS file contains all quantities transferred for recycling or treatment prior to disposal. - The COMMENTS file contains all the comments provided by facilities about substances included in their report. - The GEO LOCATIONS file contains complete geographic information for all facilities that have reported to the NPRI. Please consult the following resources to enhance your analysis: - Guide on using and Interpreting NPRI Data: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/using-interpreting-data.html - Access additional data from the NPRI, including datasets and mapping products: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/tools-resources-data/exploredata.html Supplemental Information More NPRI datasets and mapping products are available here: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/tools-resources-data/access.html Supporting Projects: National Pollutant Release Inventory (NPRI)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Uniform Appraisal Dataset (UAD) Aggregate Statistics Data File and Dashboards are the nation’s first publicly available datasets of aggregate statistics on appraisal records, giving the public new access to a broad set of data points and trends found in appraisal reports. The UAD Aggregate Statistics for Enterprise Single-Family, Enterprise Condominium, and Federal Housing Administration (FHA) Single-Family appraisals may be grouped by neighborhood characteristics, property characteristics and different geographic levels.DocumentationOverview (10/28/2024)Data Dictionary (10/28/2024)Data File Version History and Suppression Rates (12/18/2024)Dashboard Guide (2/3/2025)UAD Aggregate Statistics DashboardsThe UAD Aggregate Statistics Dashboards are the visual front end of the UAD Aggregate Statistics Data File. The Dashboards are designed to provide easy access to customized maps and charts for all levels of users. Access the UAD Aggregate Statistics Dashboards here.UAD Aggregate Statistics DatasetsNotes:Some of the data files are relatively large in size and will not open correctly in certain software packages, such as Microsoft Excel. All the files can be opened and used in data analytics software such as SAS, Python, or R.All CSV files are zipped.
Facebook
TwitterThe documentation covers Enterprise Survey panel datasets that were collected in Slovenia in 2009, 2013 and 2019.
The Slovenia ES 2009 was conducted between 2008 and 2009. The Slovenia ES 2013 was conducted between March 2013 and September 2013. Finally, the Slovenia ES 2019 was conducted between December 2018 and November 2019. The objective of the Enterprise Survey is to gain an understanding of what firms experience in the private sector.
As part of its strategic goal of building a climate for investment, job creation, and sustainable growth, the World Bank has promoted improving the business environment as a key strategy for development, which has led to a systematic effort in collecting enterprise data across countries. The Enterprise Surveys (ES) are an ongoing World Bank project in collecting both objective data based on firms' experiences and enterprises' perception of the environment in which they operate.
National
The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment must take its own financial decisions and have its own financial statements separate from those of the firm. An establishment must also have its own management and control over its payroll.
As it is standard for the ES, the Slovenia ES was based on the following size stratification: small (5 to 19 employees), medium (20 to 99 employees), and large (100 or more employees).
Sample survey data [ssd]
The sample for Slovenia ES 2009, 2013, 2019 were selected using stratified random sampling, following the methodology explained in the Sampling Manual for Slovenia 2009 ES and for Slovenia 2013 ES, and in the Sampling Note for 2019 Slovenia ES.
Three levels of stratification were used in this country: industry, establishment size, and oblast (region). The original sample designs with specific information of the industries and regions chosen are included in the attached Excel file (Sampling Report.xls.) for Slovenia 2009 ES. For Slovenia 2013 and 2019 ES, specific information of the industries and regions chosen is described in the "The Slovenia 2013 Enterprise Surveys Data Set" and "The Slovenia 2019 Enterprise Surveys Data Set" reports respectively, Appendix E.
For the Slovenia 2009 ES, industry stratification was designed in the way that follows: the universe was stratified into manufacturing industries, services industries, and one residual (core) sector as defined in the sampling manual. Each industry had a target of 90 interviews. For the manufacturing industries sample sizes were inflated by about 17% to account for potential non-response cases when requesting sensitive financial data and also because of likely attrition in future surveys that would affect the construction of a panel. For the other industries (residuals) sample sizes were inflated by about 12% to account for under sampling in firms in service industries.
For Slovenia 2013 ES, industry stratification was designed in the way that follows: the universe was stratified into one manufacturing industry, and two service industries (retail, and other services).
Finally, for Slovenia 2019 ES, three levels of stratification were used in this country: industry, establishment size, and region. The original sample design with specific information of the industries and regions chosen is described in "The Slovenia 2019 Enterprise Surveys Data Set" report, Appendix C. Industry stratification was done as follows: Manufacturing – combining all the relevant activities (ISIC Rev. 4.0 codes 10-33), Retail (ISIC 47), and Other Services (ISIC 41-43, 45, 46, 49-53, 55, 56, 58, 61, 62, 79, 95).
For Slovenia 2009 and 2013 ES, size stratification was defined following the standardized definition for the rollout: small (5 to 19 employees), medium (20 to 99 employees), and large (more than 99 employees). For stratification purposes, the number of employees was defined on the basis of reported permanent full-time workers. This seems to be an appropriate definition of the labor force since seasonal/casual/part-time employment is not a common practice, except in the sectors of construction and agriculture.
For Slovenia 2009 ES, regional stratification was defined in 2 regions. These regions are Vzhodna Slovenija and Zahodna Slovenija. The Slovenia sample contains panel data. The wave 1 panel “Investment Climate Private Enterprise Survey implemented in Slovenia” consisted of 223 establishments interviewed in 2005. A total of 57 establishments have been re-interviewed in the 2008 Business Environment and Enterprise Performance Survey.
For Slovenia 2013 ES, regional stratification was defined in 2 regions (city and the surrounding business area) throughout Slovenia.
Finally, for Slovenia 2019 ES, regional stratification was done across two regions: Eastern Slovenia (NUTS code SI03) and Western Slovenia (SI04).
Computer Assisted Personal Interview [capi]
Questionnaires have common questions (core module) and respectfully additional manufacturing- and services-specific questions. The eligible manufacturing industries have been surveyed using the Manufacturing questionnaire (includes the core module, plus manufacturing specific questions). Retail firms have been interviewed using the Services questionnaire (includes the core module plus retail specific questions) and the residual eligible services have been covered using the Services questionnaire (includes the core module). Each variation of the questionnaire is identified by the index variable, a0.
Survey non-response must be differentiated from item non-response. The former refers to refusals to participate in the survey altogether whereas the latter refers to the refusals to answer some specific questions. Enterprise Surveys suffer from both problems and different strategies were used to address these issues.
Item non-response was addressed by two strategies: a- For sensitive questions that may generate negative reactions from the respondent, such as corruption or tax evasion, enumerators were instructed to collect the refusal to respond as (-8). b- Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.
For 2009 and 2013 Slovenia ES, the survey non-response was addressed by maximizing efforts to contact establishments that were initially selected for interview. Up to 4 attempts were made to contact the establishment for interview at different times/days of the week before a replacement establishment (with similar strata characteristics) was suggested for interview. Survey non-response did occur but substitutions were made in order to potentially achieve strata-specific goals. Further research is needed on survey non-response in the Enterprise Surveys regarding potential introduction of bias.
For 2009, the number of contacted establishments per realized interview was 6.18. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The relatively low ratio of contacted establishments per realized interview (6.18) suggests that the main source of error in estimates in the Slovenia may be selection bias and not frame inaccuracy.
For 2013, the number of realized interviews per contacted establishment was 25%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The number of rejections per contact was 44%.
Finally, for 2019, the number of interviews per contacted establishments was 9.7%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The share of rejections per contact was 75.2%.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.
IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.
IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform
The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.
Due to the changes in our systems, some tables have been affected.
Data quality has been improved across all tables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description Dataset of porosity data in Powder Bed Fusion - Laser Beam of Ti-6Al-4V obtained via X-ray Micro Computed Tomography. This work was conducted on an EOS M290. Contents poredf.csv: A csv file with pore measurements for each sample scanned.
parameters.csv: A csv file containing the process parameters and extreme value statistics (EVS) parameters for each sample scanned.
WARNING: parameters.csv is too large to open in excel. Saving it in excel will cause data loss
Facebook
TwitterThese data are daily dust count observations taken in College-Fairbanks, Alaska from 23 March 1933 to 29 August 1933. The data are part of a larger collection titled "Second International Polar Year Records, 1931-1936, Department of Terrestrial Magnetism, Carnegie Institute of Washington." Within this larger collection, the data are identified as "Series 1: College-Fairbanks IPY Station Records and Data, 1932-1934: Subseries C: Auroral and Meteorological Records and Data, 1932-1933: Dust Count Observations, March 1933 - August 1933."The data are provided in a PDF copy of the handwritten entries (Dust_Count_Observations_March1933_to_August1933.pdf). Two supporting files are also included in this data set. The first is a copy of the handwritten data transcribed to a Microsoft Excel spreadsheet (Dust_Count_Observations_March1933_to_August1933.xls). The second is a PDF document that explains the larger collection (DTM_Collection_Description.pdf).The entries were recorded using an Aitken Dust Counter. Each entry includes up to 10 counts per day with measurements of wind, clouds, and visibility. The handwritten copy has the most complete data, as some of the handwritten notes were not transcribed into the computer spreadsheet. For example, handwritten notes concerning problems with the counter itself were not transcribed into the computer spreadsheet.The data are available via FTP. NOAA@NSIDC believes these data to be of value but is unable to provide documentation. If you have information about this data set that others would find useful, please contact NSIDC User Services.
Facebook
Twitter
Facebook
TwitterThis dataset contains all current and active business licenses issued by the Department of Business Affairs and Consumer Protection. This dataset contains a large number of records /rows of data and may not be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Notepad or Wordpad, to view and search.
Data fields requiring description are detailed below.
APPLICATION TYPE: 'ISSUE' is the record associated with the initial license application. 'RENEW' is a subsequent renewal record. All renewal records are created with a term start date and term expiration date. 'C_LOC' is a change of location record. It means the business moved. 'C_CAPA' is a change of capacity record. Only a few license types my file this type of application. 'C_EXPA' only applies to businesses that have liquor licenses. It means the business location expanded.
LICENSE STATUS: 'AAI' means the license was issued.
Business license owners may be accessed at: http://data.cityofchicago.org/Community-Economic-Development/Business-Owners/ezma-pppn To identify the owner of a business, you will need the account number or legal name.
Data Owner: Business Affairs and Consumer Protection
Time Period: Current
Frequency: Data is updated daily
Facebook
TwitterThis dataset contains FEMA applicant-level data for the Individuals and Households Program (IHP). All PII information has been removed. The location is represented by county, city, and zip code. This dataset contains Individual Assistance (IA) applications from DR1439 (declared in 2002) to those declared over 30 days ago. The full data set is refreshed on an annual basis and refreshed weekly to update disasters declared in the last 18 months. This dataset includes all major disasters and includes only valid registrants (applied in a declared county, within the registration period, having damage due to the incident and damage within the incident period). Information about individual data elements and descriptions are listed in the metadata information within the dataset.rnValid registrants may be eligible for IA assistance, which is intended to meet basic needs and supplement disaster recovery efforts. IA assistance is not intended to return disaster-damaged property to its pre-disaster condition. Disaster damage to secondary or vacation homes does not qualify for IHP assistance.rnData comes from FEMA's National Emergency Management Information System (NEMIS) with raw, unedited, self-reported content and subject to a small percentage of human error.rnAny financial information is derived from NEMIS and not FEMA's official financial systems. Due to differences in reporting periods, status of obligations and application of business rules, this financial information may differ slightly from official publication on public websites such as usaspending.gov. This dataset is not intended to be used for any official federal reporting. rnCitation: The Agency’s preferred citation for datasets (API usage or file downloads) can be found on the OpenFEMA Terms and Conditions page, Citing Data section: https://www.fema.gov/about/openfema/terms-conditions.rnDue to the size of this file, tools other than a spreadsheet may be required to analyze, visualize, and manipulate the data. MS Excel will not be able to process files this large without data loss. It is recommended that a database (e.g., MS Access, MySQL, PostgreSQL, etc.) be used to store and manipulate data. Other programming tools such as R, Apache Spark, and Python can also be used to analyze and visualize data. Further, basic Linux/Unix tools can be used to manipulate, search, and modify large files.rnIf you have media inquiries about this dataset, please email the FEMA News Desk at FEMA-News-Desk@fema.dhs.gov or call (202) 646-3272. For inquiries about FEMA's data and Open Government program, please email the OpenFEMA team at OpenFEMA@fema.dhs.gov.rnThis dataset is scheduled to be superceded by Valid Registrations Version 2 by early CY 2024.
Facebook
TwitterThis is a dataset downloaded off excelbianalytics.com created off of random VBA logic. I recently performed an extensive exploratory data analysis on it and I included new columns to it, namely: Unit margin, Order year, Order month, Order weekday and Order_Ship_Days which I think can help with analysis on the data. I shared it because I thought it was a great dataset to practice analytical processes on for newbies like myself.
Facebook
TwitterThis dataset contains all current and active business licenses issued by the Department of Business Affairs and Consumer Protection. This dataset contains a large number of records /rows of data and may not be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Notepad or Wordpad, to view and search.
Data fields requiring description are detailed below.
APPLICATION TYPE: 'ISSUE' is the record associated with the initial license application. 'RENEW' is a subsequent renewal record. All renewal records are created with a term start date and term expiration date. 'C_LOC' is a change of location record. It means the business moved. 'C_CAPA' is a change of capacity record. Only a few license types my file this type of application. 'C_EXPA' only applies to businesses that have liquor licenses. It means the business location expanded.
LICENSE STATUS: 'AAI' means the license was issued.
Business license owners may be accessed at: http://data.cityofchicago.org/Community-Economic-Development/Business-Owners/ezma-pppn To identify the owner of a business, you will need the account number or legal name.
Data Owner: Business Affairs and Consumer Protection
Time Period: Current
Frequency: Data is updated daily
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data Description: The nationwide public job recruitment data, including job title, job experience requirements, academic requirements, industry, job type, the nature of the company and other fields; Time range: 2017-01-01 to 2017-10-31; Data volume: 40,000 (randomly selected in the time range), sources include major recruitment sites, corporate website, job BBS; Data Format: excel.
Facebook
TwitterThe heat pump monitoring datasets are a key output of the Electrification of Heat Demonstration (EoH) project, a government-funded heat pump trial assessing the feasibility of heat pumps across the UK’s diverse housing stock. These datasets are provided in both cleansed and raw form and allow analysis of the initial performance of the heat pumps installed in the trial. From the datasets, insights such as heat pump seasonal performance factor (a measure of the heat pump's efficiency), heat pump performance during the coldest day of the year, and half-hourly performance to inform peak demand can be gleaned.
For the second edition (December 2024), the data were updated to include cleaned performance data collected between November 2020 and September 2023. The only documentation currently available with the study is the Excel data dictionary. Reports and other contextual information can be found on the Energy Systems Catapult website.
The EoH project was funded by the Department of Business, Energy and Industrial Strategy. From 2023, it is covered by the new Department for Energy Security and Net Zero.
Data availability
This study comprises the open-access cleansed data from the EoH project and a summary dataset, available in four zipped files (see the 'Access Data' tab). Users must download all four zip files to obtain the full set of cleansed data and accompanying documentation.
When unzipped, the full cleansed data comprises 742 CSV files. Most of the individual CSV files are too large to open in Excel. Users should ensure they have sufficient computing facilities to analyse the data.
The UKDS also holds an accompanying study, SN 9049 Electrification of Heat Demonstration Project: Heat Pump Performance Raw Data, 2020-2023, which is available only to registered UKDS users. This contains the raw data from the EoH project. Since the data are very large, only the summary dataset is available to download; an order must be placed for FTP delivery of the remaining raw data. Other studies in the set include SN 9209, which comprises 30-minute interval heat pump performance data, and SN 9210, which includes daily heat pump performance data.
The Python code used to cleanse the raw data and then perform the analysis is accessible via the
"https://github.com/ES-Catapult/electrification_of_heat" target="_blank">
Energy Systems Catapult Github
Facebook
TwitterMonthly archive of all parking meter sensor activity over the previous 36 months (3 years). Updated monthly for data 2 months prior (eg. January data will be published early March). For best-available current "live" status, see "LADOT Parking Meter Occupancy". For location and parking policy details, see "LADOT Metered Parking Inventory & Policies". This dataset is geared towards database professionals and/or app developers. Each file is extremely large, over 300MB at minimum. Common applications like Microsoft Excel will not be able to open the file and show all data. ** For best results, import into a database or use advanced data access methods appropriate for processing large files.
Facebook
TwitterBusiness licenses issued by the Department of Business Affairs and Consumer Protection in the City of Chicago from 2002 to the present. This dataset contains a large number of records/rows of data and may not be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Notepad or Wordpad, to view and search.
Data fields requiring description are detailed below.
APPLICATION TYPE: ‘ISSUE’ is the record associated with the initial license application. ‘RENEW’ is a subsequent renewal record. All renewal records are created with a term start date and term expiration date. ‘C_LOC’ is a change of location record. It means the business moved. ‘C_CAPA’ is a change of capacity record. Only a few license types may file this type of application. ‘C_EXPA’ only applies to businesses that have liquor licenses. It means the business location expanded. 'C_SBA' is a change of business activity record. It means that a new business activity was added or an existing business activity was marked as expired.
LICENSE STATUS: ‘AAI’ means the license was issued. ‘AAC’ means the license was cancelled during its term. ‘REV’ means the license was revoked. 'REA' means the license revocation has been appealed.
LICENSE STATUS CHANGE DATE: This date corresponds to the date a license was cancelled (AAC), revoked (REV) or appealed (REA).
Business License Owner information may be accessed at: https://data.cityofchicago.org/dataset/Business-Owners/ezma-pppn. To identify the owner of a business, you will need the account number or legal name, which may be obtained from this Business Licenses dataset.
Data Owner: Business Affairs and Consumer Protection. Time Period: January 1, 2002 to present. Frequency: Data is updated daily.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes open-source projects written in C# programing language, annotated for the presence of Long Method and God Class code smells. Each instance was manually annotated by at least two annotators. We explain our motivation and methodology for creating this dataset in our preprint:
Luburić, N., Prokić, S., Grujić, K.G., Slivka, J., Kovačević, A., Sladić, G. and Vidaković, D., 2021. Towards a systematic approach to manual annotation of code smells.
The dataset contains two excel datasheets:
DataSet_Large Class.xlsx – C# classes annotated for the Large Class code smell severity.
DataSet_Long Method.xlsx – C# methods annotated for the Long method code smell severity.
The columns in the datasheet represent:
Code Snippet ID – the full name of the code snippet.
For classes, this is the package/namespace name followed by the class name. The full name of inner classes also contains the names of any outer classes (e.g., namespace.subnamespace.outerclass.innerclass).
For methods, this is the full name of the class and the methods’s signature (e.g., namespace.class.method(param1Type, param2Type) ).
Link – The GitHub link to the code snippet, including the commit and the start and end LOC.
Code Smell – code smell for which the code snippet is examined (Large Class or Long Method).
Project Link – the link to the version of the code repository that was annotated.
Metrics – a list of metrics for the code snippet, calculated by our platform. Our dataset provides 25 class-level metrics for Large Class detection and 18 method-level metrics for Long Method detection The list of metrics and their definitions is available here.
Final annotation – a single severity score calculated by a majority vote.
Annotators – each annotator's (1, 2, or 3) assigned severity score.
To help guide their reasoning for evaluating the presence and the severity of a code smell, three annotators independently annotated whether the considered heuristics apply to an evaluated code snippet. We provide these results in two separate excel datasheets:
LargeClass_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Large Class code smell.
LongMethod_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Large Class code smell.
The columns of these two datasheets are:
Code Snippet ID - the full name of the code snippet (matching the IDs from DataSet_Large Class.xlsx and DataSet_Long Method.xlsx)
Annotators – heuristics labelled by each of the annotators (1, 2, or 3).
Heuristics – whether the heuristic is applicable to the examined code snippet or not (Section 1.2.4 lists heuristics relevant for the Large Class detection, and Section 1.2.5 lists the heuristics relevant for the Long Method detection).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Greetings , fellow analysts !
(NOTE : This is a random dataset generated using python. It bears no resemblance to any real entity in the corporate world. Any resemblance is a matter of coincidence.)
REC-SSEC Bank is a govt-aided bank operating in the Indian Peninsula. They have regional branches in over 40+ regions of the country. You have been provided with a massive excel sheet containing the transaction details, the total transaction amount and their location and total transaction count.
The dataset is described as follows :
For example , in the very first row , the data can be read as : " On the first of January, 2022 , 1932 transactions of summing upto INR 365554 from Bhuj were reported " NOTE : There are about 2750 transactions every single day. All of this has been given to you.
The bank wants you to answer the following questions :