Facebook
TwitterThe USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
File name definitions:
'...v_50_175_250_300...' - dataset for velocity ranges [50, 175] + [250, 300] m/s
'...v_175_250...' - dataset for velocity range [175, 250] m/s
'ANNdevelop...' - used to perform 9 parametric sub-analyses where, in each one, many ANNs are developed (trained, validated and tested) and the one yielding the best results is selected
'ANNtest...' - used to test the best ANN from each aforementioned parametric sub-analysis, aiming to find the best ANN model; this dataset includes the 'ANNdevelop...' counterpart
Where to find the input (independent) and target (dependent) variable values for each dataset/excel ?
input values in 'IN' sheet
target values in 'TARGET' sheet
Where to find the results from the best ANN model (for each target/output variable and each velocity range)?
open the corresponding excel file and the expected (target) vs ANN (output) results are written in 'TARGET vs OUTPUT' sheet
Check reference below (to be added when the paper is published)
https://www.researchgate.net/publication/328849817_11_Neural_Networks_-_Max_Disp_-_Railway_Beams
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
PROJECT OBJECTIVE
We are a part of XYZ Co Pvt Ltd company who is in the business of organizing the sports events at international level. Countries nominate sportsmen from different departments and our team has been given the responsibility to systematize the membership roster and generate different reports as per business requirements.
Questions (KPIs)
TASK 1: STANDARDIZING THE DATASET
TASK 2: DATA FORMATING
TASK 3: SUMMARIZE DATA - PIVOT TABLE (Use SPORTSMEN worksheet after attempting TASK 1) • Create a PIVOT table in the worksheet ANALYSIS, starting at cell B3,with the following details:
TASK 4: SUMMARIZE DATA - EXCEL FUNCTIONS (Use SPORTSMEN worksheet after attempting TASK 1)
• Create a SUMMARY table in the worksheet ANALYSIS,starting at cell G4, with the following details:
TASK 5: GENERATE REPORT - PIVOT TABLE (Use SPORTSMEN worksheet after attempting TASK 1)
• Create a PIVOT table report in the worksheet REPORT, starting at cell A3, with the following information:
Process
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The open repository consists of two folders; Dataset and Picture. The dataset folder consists file “AWS Dataset Pangandaraan.xlsx”. There are 10 columns with three first columns as time attributes and the other six as atmosphere datasets. Each parameter has 8085 data, and Each parameter has a parameter index at the bottom of the column we added, including mMinimum, mMaximum, and Average values.
For further use, the user can choose one or more parameters for calculating or analyzing. For example, wind data (speed and direction) can be utilized to calculate Waves using the Hindcast method. Furthermore, the user can filter data by using the feature in Excel to extract the exact time range for analyzing various phenomena considered correlated to atmosphere data around Pangandaran, Indonesia.
The second folder, named “Picture,” contains three figures, including the monthly distribution of datasets, temporal data, and wind rose. Furthermore, the user can filter data by using the feature in Excel sheet to extract the exact time range for analyzing various phenomena considered correlated to atmosphere data around Pangandaran, Indonesia
Facebook
TwitterContext: The University Attendance Sheet Dataset is a comprehensive collection of attendance records from various university courses. This dataset is valuable for analyzing student attendance patterns, studying the impact of attendance on academic performance, and exploring factors influencing student engagement. It provides a rich resource for researchers, educators, and students interested in understanding attendance dynamics within a university setting.
Content: The dataset includes the following information:
Student ID: A unique identifier for each student. Course ID: A unique identifier for each course. Date: The date of the attendance record. Attendance Status: Indicates whether the student was present, absent, or had an excused absence on a particular date. The dataset contains records from multiple academic semesters, covering a wide range of courses across different disciplines. By examining this dataset, researchers can investigate attendance trends across different courses, identify patterns related to student performance, and explore correlations between attendance and other academic variables.
Acknowledgements: We would like to express our gratitude to the university administration, faculty members, and students who contributed to the collection and organization of this dataset. Their cooperation and support have made this dataset possible, enabling valuable insights into student attendance dynamics.
Inspiration: The inspiration behind creating this dataset stems from the recognition of the significant role attendance plays in a student's academic journey. By making this dataset available on Kaggle, we hope to facilitate research and analysis on attendance patterns, identify interventions to improve student engagement, and provide educators with valuable insights to enhance their teaching strategies. We also encourage collaboration and exploration of the dataset to uncover new findings and generate knowledge that can benefit the education community as a whole.
By leveraging the University Attendance Sheet Dataset, we aspire to contribute to the ongoing efforts to improve student success and foster an environment that promotes active participation and learning within higher education institutions.
Facebook
TwitterThis dataset represents CLIGEN input parameters for locations in 68 countries. CLIGEN is a point-scale stochastic weather generator that produces long-term weather simulations with daily output. The input parameters are essentially monthly climate statistics that also serve as climate benchmarks. Three unique input parameter sets are differentiated by having been produced from 30-year, 20-year and 10-year minimum record lengths that correspond to 7673, 2336, and 2694 stations, respectively. The primary source of data is the NOAA GHCN-Daily dataset, and due to data gaps, records longer than the three minimum record lengths were often queried to produce the needed number of complete monthly records. The vast majority of stations used at least some data from the 2000's, and temporal coverages are shown in the Excel table for each station. CLIGEN has various applications including being used to force soil erosion models. This dataset may reduce the effort needed in preparing climate inputs for such applications. Revised input files added on 11/16/20. These files were revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months. Second revision input files added on 2/12/20. A formatting error was fixed that affected transition probabilities for 238 stations with zero recorded precipitation for one or more months. The affected stations were predominantly in Australia and Mexico. Resources in this dataset:Resource Title: 30-year input files. File Name: 30-year.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files. File Name: 20-year.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files. File Name: 10-year.zipResource Description: CLIGEN .par input files based on 10-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: Map Layer. File Name: MapLayer.kmzResource Description: Map Layer showing locations of the new CLIGEN stations. This layer may be imported into Google Earth and used to find the station closest to an area of interest.Resource Software Recommended: Google Earth,url: https://www.google.com/earth/ Resource Title: Temporal Ranges of Years Queried. File Name: GHCN-Daily Year Ranges.xlsxResource Description: Excel tables of the first and last years queried from GHCN-Daily when searching for complete monthly records (with no gaps in data). Any ranges in excess of 30 years, 20 years and 10 years, for respective datasets, are due to data gaps.Resource Title: 30-year input files (revised). File Name: 30-year revised.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files (revised). File Name: 20-year revised.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files (revised). File Name: 10-year revised.zipResource Description: CLIGEN .par input files based on 10-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 30-year input files (revised 2). File Name: 30-year revised 2.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files (revised 2). File Name: 20-year revised 2.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files (revised 2). File Name: 10-year revised 2.zipResource Description: CLIGEN *.par input files based on 10-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Finding a good data source is the first step toward creating a database. Cardiovascular illnesses (CVDs) are the major cause of death worldwide. CVDs include coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel problems. According to the World Health Organization, 17.9 million people die each year. Heart attacks and strokes account for more than four out of every five CVD deaths, with one-third of these deaths occurring before the age of 70 A comprehensive database for factors that contribute to a heart attack has been constructed , The main purpose here is to collect characteristics of Heart Attack or factors that contribute to it. As a result, a form is created to accomplish this. Microsoft Excel was used to create this form. Figure 1 depicts the form which It has nine fields, where eight fields for input fields and one field for output field. Age, gender, heart rate, systolic BP, diastolic BP, blood sugar, CK-MB, and Test-Troponin are representing the input fields, while the output field pertains to the presence of heart attack, which is divided into two categories (negative and positive).negative refers to the absence of a heart attack, while positive refers to the presence of a heart attack.Table 1 show the detailed information and max and min of values attributes for 1319 cases in the whole database.To confirm the validity of this data, we looked at the patient files in the hospital archive and compared them with the data stored in the laboratories system. On the other hand, we interviewed the patients and specialized doctors. Table 2 is a sample for 1320 cases, which shows 44 cases and the factors that lead to a heart attack in the whole database,After collecting this data, we checked the data if it has null values (invalid values) or if there was an error during data collection. The value is null if it is unknown. Null values necessitate special treatment. This value is used to indicate that the target isn’t a valid data element. When trying to retrieve data that isn't present, you can come across the keyword null in Processing. If you try to do arithmetic operations on a numeric column with one or more null values, the outcome will be null. An example of a null values processing is shown in Figure 2.The data used in this investigation were scaled between 0 and 1 to guarantee that all inputs and outputs received equal attention and to eliminate their dimensionality. Prior to the use of AI models, data normalization has two major advantages. The first is to avoid overshadowing qualities in smaller numeric ranges by employing attributes in larger numeric ranges. The second goal is to avoid any numerical problems throughout the process.After completion of the normalization process, we split the data set into two parts - training and test sets. In the test, we have utilized1060 for train 259 for testing Using the input and output variables, modeling was implemented.
Facebook
TwitterDisplays a representation of where all the surveyed businesses across York Region are located. This data is collected through the Region’s annual comprehensive employment survey and each record contains employment and business contact information about each business with the exception of home and farm-based businesses. Home-based businesses are not included as they are distributed throughout residential communities within the Region and are difficult to survey. Employment data for farm-based businesses are collected through the Census of Agriculture conducted by Statistics Canada, and are not included in the York Region Employment Survey dataset.Update Frequency: Not PlannedDate Created: 17/03/2023Date Modified: 17/03/2023Metadata Date: 17/03/2023Citation Contacts: York Region, Long Range PlanningAttribute DefinitionsBUSINESSID: Unique key to identify a business.NAME: The common business name used in everyday transactions. FULL_ADDRESS: Full street address of the physical address. (This field concatenates the following fields: Street Number, Street Name, Street Type, Street Direction)STREET_NUM: Street number of the physical addressSTREET_NAME: Street name of the physical addressSTREET_TYPE: Street type of the physical addressSTREET_DIR: Street direction of the physical addressUNIT_NUM: Unit number of the physical addressCOMMUNITY: Community name where the business is physically locatedMUNICIPALITY: Municipality where the business is physically locatedPOST_CODE: Postal code corresponding to the physical street addressEMPLOYEE_RANGE: The numerical range of employees working in a given firm. PRIM_NAICS, PRIM_NAICS_DESC: The Primary 5-digit NAIC code defines the main business activity that occurs at that particular physical business location.SEC_NAICS, SEC_NAICS_DESC: If there is more than one business activity occurring at a particular business location (that is substantially different from the primary business activity), then a secondary NAIC is assigned.PRIM_BUS_CLUSTER, SEC_BUS_CLUSTER: A business cluster is defined as a geographic concentration of interconnected businesses and institutions in a common industry that both compete and cooperate. As defined by York Region, this field indicates the primary business cluster that this business belongs to.BUS_ACTIVITY_DESC: This is a comment box with a detailed text description of the business activity.TRAFFIC_ZONE: Specifies the traffic zone in which the business is located. MANUFACTURER: Indicates whether or not the business manufactures at the physical business location. CAN_HEADOFFICE: The business at this location is considered the Canadian head office.HEADOFFICEPROVSTATE: Indicates which state or province the head office is located if the head office is located in Canada (outside of Ontario) or in the United StatesHEADOFFICECOUNTRY: Indicates which country the head office is locatedYR_CURRENTLOC: Indicates the year that the business moved into its current address.MAIL_FULL_ADDRESS: The mailing address is the address through which a business receives postal service. This may or may not be the same as the physical street address.MAIL_STREET_NUM, MAIL_STREET_NAME, MAIL_STREET_TYPE, MAIL_STREET_DIR, MAIL_UNIT_NUM, MAIL_COMMUNITY, MAIL_MUNICIPALITY, MAIL_PROVINCE, MAIL_COUNTRY, MAIL_POST_CODE, MAIL_POBOX: Mailing address fields are similar to street address fields and in most cases will be the same as the Street Address. Some examples where the two addresses might not be the same include, multiple location businesses, home-based businesses, or when a business receives mail through a P.O. Box.WEBSITE: The General/Main business website.GEN_BUS_EMAIL: The general/main business e-mail address for that location.PHONE_NO: The general/main phone number for the business location.PHONE_EXT: The extension (if any) for the general/main business phone number.LAST_SURVEYED: The date the record was last surveyedLAST_UPDATED: The date the record was last updatedUPDATEMETHOD: Displays how the business was last updated, based on a predetermined list.X_COORD, Y_COORD: The x,y coordinates of the surveyed business locationFrequently Asked Questions How many businesses are included in the 2022 York Region Business Directory? The 2022 York Region Business Directory contains just over 34,000 business listings. In the past, businesses were annually surveyed, either in person or by telephone to improve the accuracy of the directory. Due to the COVID-19 Pandemic, a survey was not complete in 2020 and 2021. The Region may return to annual surveying in future years, however the next employment survey will be in 2024. This listing also includes home-based businesses that participated in the 2022 employment survey. What is a NAIC code? The North American Industrial Classification (NAIC) coding system is a hierarchical classification system developed in Canada, Mexico and the United States. It was developed to allow for the comparison of business and employment information across a variety of industry categories. The NAICS has a hierarchical structure, designed as follows: Two-digits = sector (e.g., 31-33 contain the Manufacturing sectors) Three-digits = subsector (e.g., 336 = Transportation Equipment Manufacturing) Four-digits = industry group (e.g., 3361 = Motor Vehicle Manufacturing) Five-digits = industry (e.g., 33611 = Automobile and Light Duty Motor Vehicle Manufacturing) For more information on the NAIC coding system click here How do I add or update my business information in the York Region Business Directory? To add or update your business information, please select one of the following methods: • Email: Please email businessdirectory@york.ca to request to be added to the Business Directory. • Online: Go to www.york.ca/employmentsurvey and participate in the employment survey - note, this will only be active in 2024 when the Region performs its next employment survey There is no charge for obtaining a basic listing of your business in the York Region Business Directory. How up-to-date is the information? This directory is based on the 2022 York Region Employment Survey, a survey of businesses which attempts to gather information from all businesses across York Region. In instances where we were unable to gather information, the most recent data was used. Farm-based businesses have not been included in the survey and home-based businesses that participated in the 2022 survey are included in the dataset. The date that the business listing was last updated is located in the LastUpdate column in the attached spreadsheet. Are different versions of the York Region Business Directory available? Yes, the directory is available in two online formats: • An interactive, map-based directory searchable by company name, street address, municipality and industry sector. • The entire dataset in downloadable Microsoft Excel format via York Region's Open Data Portal. This version of the York Region Business Directory 2022 is offered free of charge. The Directory allows for the detailed analysis of business and employment trends, as well as the construction of targeted contact lists. To view the map-based directory and dataset, go to: 2022 Business Directory - Map Is there any analysis of business and employment trends in York Region? Yes. The "2022 Employment and Industry Report" contains information on employment trends in York Region and is based on results from the employment survey. please visit www.york.ca/york-region/plans-reports-and-strategies/employment-and-industry-report to view the report. What other resources are available for York Region businesses? York Region offers an export advisory service and a number of other business development programs and seminars for interested individuals. For details, consult the York Region Economic Strategy Branch. Who do I contact to obtain more information about the Directory? For more information on the York Region Business Directory, contact the Planning and Economic Development Branch at: businessdirectory@york.ca.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The van der Waals volume is a widely used descriptor in modeling physicochemical properties. However, the calculation of the van der Waals volume (VvdW) is rather time-consuming, from Bondi group contributions, for a large data set. A new method for calculating van der Waals volume has been developed, based on Bondi radii. The method, termed Atomic and Bond Contributions of van der Waals volume (VABC), is very simple and fast. The only information needed for calculating VABC is atomic contributions and the number of atoms, bonds, and rings. Then, the van der Waals volume (Å3/molecule) can be calculated from the following formula: VvdW = ∑ all atom contributions − 5.92NB − 14.7RA − 3.8RNR (NB is the number of bonds, RA is the number of aromatic rings, and RNA is the number of nonaromatic rings). The number of bonds present (NB) can be simply calculated by NB = N − 1 + RA + RNA (where N is the total number of atoms). A simple Excel spread sheet has been made to calculate van der Waals volumes for a wide range of 677 organic compounds, including 237 drug compounds. The results show that the van der Waals volumes calculated from VABC are equivalent to the computer-calculated van der Waals volumes for organic compounds.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset contains MS Excel spreadsheet code used to analyze an integrative model that illustrates the inherent trade-offs that will arise among the competing values for landscape space in a boreal forest ecosystem involving interactions among the main trophic compartments of an intact boreal ecosystem, aka “nature”. The model accounts for carbon accumulation via biomass growth of forest trees (timber), carbon loss due to controls from moose herbivory that varies with moose population density (hunting), and soil carbon inputs and release, which together determine net ecosystem productivity (NEP), a measure of carbon sink strength of the ecosystem. We examine how controls on carbon dynamics are altered by forest management for timber harvest, and by moose hunting. We link the ecological dynamics with an economic analysis by assigning a price to carbon stored within the intact boreal forest ecosystem. We then weigh these carbon impacts against the economic benefits of timber production and hunting across a range of moose population densities. Combined, this carbon-bioeconomic program calculates the total ecosystem benefit of a modelled boreal forest system, providing a framework for examining how different forest harvest and moose densities influence the achievement of carbon storage targets, under different levels of carbon pricing. Methods The Excel spreadsheet converts the analytical model into code to numerically calculate carbon benefits. Data in the article figures are generated using the spread code.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
While inspecting the brain magnetic resonance imaging (MRI) scans from a sample of Multiple Sclerosis (MS) patients, blind to any clinical, cognitive and demographic information, it caught our attention the presence of ovoidal or circular, partially stellate, regions of signal intensities similar to that of the normal brain parenchyma in Fluid Attenuated Inversion Recovery (FLAIR) surrounded by hyperintensities in the periventricular region in a reasonable number of scans, seemingly corresponding in all cases to hypointense regions (i.e. with the same signal level of the cerebrospinal fluid) in T1-weighted. The ovoidal shape of these features, clearly distinctive due to their homogeneously lower signal with respect to their surroundings in the FLAIR sequence prompted us to refer them as FLAIR 'pseudocavities'. The idea that they could be differentially distinctive and indicative of an underlying process of different aetiology from their surroundings is not implausible. Inversion recovery imaging can potentially discriminate among tissues based on subtle differences in T1 characteristics. Specifically, the FLAIR sequence exploits the fact that many types of pathology have elevated T1 and T2 values resulting from increased free water content compared to background tissue. Higher specific absorption rate due to additional 180 degrees, together with the increased dynamic range, and the additive T1 and T2 contrast, make FLAIR highly susceptible to differentially reflect subtle pathological processes (Bydder & Young, 1985). We, hence, systematically reviewed the literature in the last 10 years (i.e. from March 1999 up to March 2019) to investigate the definitions of MS lesions used up to date and their characterisation, to establish if what we called 'FLAIR 'pseudocavities'' have been described previously. This dataset is conformed by an excel file (Microsoft excel 97-2003 (.xls)) with multiple worksheets which contain all the references found in the two databases explored (i.e. Medline and EMBASE), as well as the data extracted and the results of the analyses. Briefly, from just over a hundred studies that defined MRI lesions in MS, more than half characterised lesions based on the criteria that they were hyperintense on T2-weighted, FLAIR and PD-weighted series, and more than a quarter of the studies characterised lesions based on the criteria that they were hyperintense on T2-weighted, FLAIR and PD-weighted and that they were hypointense on T1-weighted series. The literature review confirmed that what we refer to as FLAIR 'pseudocavities' have not yet been acknowledged in the MS literature. Note: The dataset contains a master excel spreadsheet with multiple worksheets. The data from each worksheet in the excel file is also provided as a .csv file
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
It is separated by subject as well as by measurement type and measurement number. The caption for the Supporting Information files is “Raw data”. (ZIP)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
HelpSteer is an Open-Source dataset designed to empower AI Alignment through the support of fair, team-oriented annotation. The dataset provides 37,120 samples each containing a prompt and response along with five human-annotated attributes ranging between 0 and 4; with higher results indicating better quality. Using cutting-edge methods in machine learning and natural language processing in combination with the annotation of data experts, HelpSteer strives to create a set of standardized values that can be used to measure alignment between human and machine interactions. With comprehensive datasets providing responses rated for correctness, coherence, complexity, helpfulness and verbosity, HelpSteer sets out to assist organizations in fostering reliable AI models which ensure more accurate results thereby leading towards improved user experience at all levels
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use HelpSteer: An Open-Source AI Alignment Dataset
HelpSteer is an open-source dataset designed to help researchers create models with AI Alignment. The dataset consists of 37,120 different samples each containing a prompt, a response and five human-annotated attributes used to measure these responses. This guide will give you a step-by-step introduction on how to leverage HelpSteer for your own projects.
Step 1 - Choosing the Data File
Helpsteer contains two data files – one for training and one for validation. To start exploring the dataset, first select the file you would like to use by downloading both train.csv and validation.csv from the Kaggle page linked above or getting them from the Google Drive repository attached here: [link]. All the samples in each file consist of 7 columns with information about a single response: prompt (given), response (submitted), helpfulness, correctness, coherence, complexity and verbosity; all sporting values between 0 and 4 where higher means better in respective category.
## Step 2—Exploratory Data Analysis (EDA) Once you have your file loaded into your workspace or favorite software environment (e.g suggested libraries like Pandas/Numpy or even Microsoft Excel), it’s time explore it further by running some basic EDA commands that summarize each feature's distribution within our data set as well as note potential trends or points of interests throughout it - e.g what are some traits that are polarizing these responses more? Are there any outliers that might signal something interesting happening? Plotting these results often provides great insights into pattern recognition across datasets which can be used later on during modeling phase also known as “Feature Engineering”
## Step 3—Data Preprocessing After your interpretation of raw data while doing EDA should form some hypotheses around what features matter most when trying to estimate attribute scores of unknown responses accurately so proceeding with preprocessing such as cleaning up missing entries or handling outliers accordingly becomes highly recommended before starting any modelling efforts with this data set - kindly refer also back at Kaggle page description section if unsure about specific attributes domain ranges allowed values explicitly for extra confidence during this step because having correct numerical suggestions ready can make modelling workload lighter later on while building predictive models . It’s important not rushing over this stage otherwise poor results may occur later when aiming high accuracy too quickly upon model deployment due low quality
- Designating and measuring conversational AI engagement goals: Researchers can utilize the HelpSteer dataset to design evaluation metrics for AI engagement systems.
- Identifying conversational trends: By analyzing the annotations and data in HelpSteer, organizations can gain insights into what makes conversations more helpful, cohesive, complex or consistent across datasets or audiences.
- Training Virtual Assistants: Train artificial intelligence algorithms on this dataset to develop virtual assistants that respond effectively to customer queries with helpful answers
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons.org/pu...
Facebook
TwitterApplications of dataset:
Job Analysis: By analyzing the dataset, you can gain insights into the current internship landscape, including the types of internships available, the companies offering them, and the locations where these opportunities exist. This information can help students or job seekers make informed decisions about their internship choices.
Trend Identification: The dataset allows you to identify trends in internships, such as which fields or industries are actively hiring interns, the distribution of internship types across different locations, or how internship salaries vary across different categories or locations. This can be valuable for understanding the dynamics of the job market and identifying emerging trends.
Salary Comparison: With the salary information in the dataset, you can compare the compensation offered by different companies for similar internships. This can help interns evaluate the financial aspects of different opportunities and negotiate better terms. It can also shed light on the salary ranges for internships in different industries or locations.
Company Analysis: By analyzing the companies offering internships, you can gain insights into their hiring practices, internship programs, and industry presence. This information can be useful for students interested in specific companies or industries and can help them tailor their applications accordingly.
Educational Research: Researchers or educators in the field of internships or career development can use the dataset to study internship trends, identify skill requirements for different internships, or analyze the relationship between internship experience and future job prospects.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a dynamic Excel model for prioritizing projects based on Feasibility, Impact, and Size.
It visualizes project data on a Bubble Chart that updates automatically when new projects are added.
Use this tool to make data-driven prioritization decisions by identifying which projects are most feasible and high-impact.
Organizations often struggle to compare multiple initiatives objectively.
This matrix helps teams quickly determine which projects to pursue first by visualizing:
Example (partial data):
| Criteria | Project 1 | Project 2 | Project 3 | Project 4 | Project 5 | Project 6 | Project 7 | Project 8 |
|---|---|---|---|---|---|---|---|---|
| Feasibility | 7 | 9 | 5 | 2 | 7 | 2 | 6 | 8 |
| Impact | 8 | 4 | 4 | 6 | 6 | 7 | 7 | 7 |
| Size | 10 | 2 | 3 | 7 | 4 | 4 | 3 | 1 |
| Quadrant | Description | Action |
|---|---|---|
| High Feasibility / High Impact | Quick wins | Top Priority |
| High Impact / Low Feasibility | Valuable but risky | Plan carefully |
| Low Impact / High Feasibility | Easy but minor value | Optional |
| Low Impact / Low Feasibility | Low return | Defer or drop |
Project_Priority_Matrix.xlsx. You can use this for:
- Portfolio management
- Product or feature prioritization
- Strategy planning workshops
Project_Priority_Matrix.xlsxFree for personal and organizational use.
Attribution is appreciated if you share or adapt this file.
Author: [Asjad]
Contact: [m.asjad2000@gmail.com]
Compatible With: Microsoft Excel 2019+ / Office 365
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In this project, I embarked on a journey to refine my Python skills, particularly focusing on web scraping. Initially inspired by a GitHub project PythonYouTubeSeries/Scraping a Table from a Website.ipynb at main · AlexTheAnalyst/PythonYouTubeSeries · GitHub involving scraping data from a Wikipedia page listing the largest companies in the United States by revenue, I sought to go beyond basic scraping. However, I realized that to stand out among other job applicants, I needed to take this project a step further. So, I decided to not only scrape data but then take my csv file to a excel workbook demonstrating my ability to use a range of data analysis tools and apps to collect, clean, and provide comprehensive analysis based on a suitable use case that would display my ability to deliver actionable insights.
I opted the following case study that fit with the data i had obtained: identifying high-growth sectors and leading companies within those sectors for an investment portfolio. Leveraging Python libraries BeautifulSoup for HTML parsing and requests for web page retrieval, I meticulously extracted data from the Wikipedia page. Once retrieved, I organized this data into a structured format and proceeded to pinpoint the most promising sectors, by performing a cross-industry data analysis. I started by examining the average revenue growth across various sectors. The Petroleum Industry emerged prominently with a high growth rate of 48.89%. The Retail sector, despite a slower growth rate of 7.28%, demonstrated its vast scale, while the Healthcare sector’s growth rate of 10.82% highlighted its impressive performance. This can translate to more robust market stability and resilience during economic fluctuations. While Infotech and airlines also showed significant revenue growth, they were not the primary focus of this study. My ILM Business Analytics certification project on Ryanair will provide detailed insights into the airline industry, available soon on my Kaggle.
The results of my analysis are detailed below:
Petroleum Industry Analysis:
Total Revenue Analysis: ExxonMobil: $827,360 million Chevron Corporation: $492,504 million Marathon Petroleum: $360,024 million Phillips 66: $351,404 million Revenue Growth Analysis: PBF Energy: 0.718 ConocoPhillips: 0.699 Valero Energy: 0.58 Recommendations:
Market Leaders: Investing in ExxonMobil and Chevron Corporation for stability and reliable returns. High-Growth Opportunities: PBF Energy and ConocoPhillips for higher growth potential. Retail Sector Analysis:
Total Revenue Analysis: Walmart: $1,222,578 million Costco: $453,908 million The Home Depot: $314,806 million Best Buy and Publix: Under $200,000 million Revenue Growth Analysis: Costco: 0.158 Publix: 0.135 Best Buy: 0.106 Lowe's: 0.008 Recommendations:
Market Leaders: Walmart for stability and consistent performance. High-Growth Opportunities: Costco and Publix for higher growth potential. Healthcare and Pharmaceutical Industry Analysis:
Total Revenue Analysis: UnitedHealth Group: $648,324 million CVS Health: $644,934 million Cardinal Health: $362,728 million Elevance Health: $313,190 million Revenue Growth Analysis: Pfizer: 0.234 Humana: 0.118 Merck & Co.: 0.158 Bristol-Myers Squibb: 0.005 Recommendations:
Market Leaders: UnitedHealth Group and CVS Health for stability and robust returns. High-Growth Opportunities: Pfizer for substantial growth, with Humana and Merck & Co. also showing strong growth rates.
Conclusion: The Petroleum Industry exhibits substantial growth, making it attractive for future investment opportunities, with ExxonMobil, Chevron, and ConocoPhillips demonstrating strong revenue and impressive growth rates. Similarly, the Retail and Healthcare sectors also present significant investment opportunities, with market leaders providing stability and high-growth companies offering potential for substantial returns.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This is a condensed version of the raw data obtained through the Google Data Analytics Course, made available by Lyft and the City of Chicago under this license (https://ride.divvybikes.com/data-license-agreement).
I originally did my study in another platform, and the original files were too large to upload to Posit Cloud in full. Each of the 12 monthly files contained anywhere from 100k to 800k rows. Therefore, I decided to reduce the number of rows drastically by performing grouping, summaries, and thoughtful omissions in Excel for each csv file. What I have uploaded here is the result of that process.
Data is grouped by: month, day, rider_type, bike_type, and time_of_day. total_rides represent the sum of the data in each grouping as well as the total number of rows that were combined to make the new summarized row, avg_ride_length is the calculated average of all data in each grouping.
Be sure that you use weighted averages if you want to calculate the mean of avg_ride_length for different subgroups as the values in this file are already averages of the summarized groups. You can include the total_rides value in your weighted average calculation to weigh properly.
date - year, month, and day in date format - includes all days in 2022 day_of_week - Actual day of week as character. Set up a new sort order if needed. rider_type - values are either 'casual', those who pay per ride, or 'member', for riders who have annual memberships. bike_type - Values are 'classic' (non-electric, traditional bikes), or 'electric' (e-bikes). time_of_day - this divides the day into 6 equal time frames, 4 hours each, starting at 12AM. Each individual ride was placed into one of these time frames using the time they STARTED their rides, even if the ride was long enough to end in a later time frame. This column was added to help summarize the original dataset. total_rides - Count of all individual rides in each grouping (row). This column was added to help summarize the original dataset. avg_ride_length - The calculated average of all rides in each grouping (row). Look to total_rides to know how many original rides length values were included in this average. This column was added to help summarize the original dataset. min_ride_length - Minimum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset. max_ride_length - Maximum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset.
Please note: the time_of_day column has inconsistent spacing. Use mutate(time_of_day = gsub(" ", "", time_of _day)) to remove all spaces.
Below is the list of revisions I made in Excel before uploading the final csv files to the R environment:
Deleted station location columns and lat/long as much of this data was already missing.
Deleted ride id column since each observation was unique and I would not be joining with another table on this variable.
Deleted rows pertaining to "docked bikes" since there were no member entries for this type and I could not compare member vs casual rider data. I also received no information in the project details about what constitutes a "docked" bike.
Used ride start time and end time to calculate a new column called ride_length (by subtracting), and deleted all rows with 0 and 1 minute results, which were explained in the project outline as being related to staff tasks rather than users. An example would be taking a bike out of rotation for maintenance.
Placed start time into a range of times (time_of_day) in order to group more observations while maintaining general time data. time_of_day now represents a time frame when the bike ride BEGAN. I created six 4-hour time frames, beginning at 12AM.
Added a Day of Week column, with Sunday = 1 and Saturday = 7, then changed from numbers to the actual day names.
Used pivot tables to group total_rides, avg_ride_length, min_ride_length, and max_ride_length by date, rider_type, bike_type, and time_of_day.
Combined into one csv file with all months, containing less than 9,000 rows (instead of several million)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThe USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel