Facebook
TwitterThe USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset: Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided). Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
Facebook
TwitterThis dataset was created by Shiva Vashishtha
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The Orders database contains information on the following variables.
• Continuous variables: Row ID, Order ID, Order Date, Ship Date, Customer ID, Product ID, Sales, Quantity, Discount, Profit, Shipping Cost
• Categorical variables: Ship Mode, Customer Name, Segment, Postal Code, City, State, Country, Region, Market, Category, Subcategory, Product Name, Order Priority
The purpose of this project: 1. To use descriptive statistics methods to assess the sales performance across various segments, markets, product categories and subcategories; 2. To use diagnostic analytics methods to understand the statistical significance of the factors that influence sales; 3. Use predictive analytics (regression) to understand the strengths of the relationship between sales and sales drivers and generate a regression formula to predict sales 4. develop a sales forecasting model based on the insights.
Descriptive analytics
Descriptive statistics for sales
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F848f47b38b7f2360163bb2221703c658%2FPicture2.png?generation=1715109635788424&alt=media" alt="">
Frequency distribution for sales
Around 44,500 transactions of value >=USD 500.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F39cfd8ffd8fdf296300bb9f1fa5243e2%2FPicture3.png?generation=1715109667755923&alt=media" alt="">
Sales values across markets
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F3385959d11b6daafae24c848b4b00f13%2FPicture4.png?generation=1715109744629587&alt=media" alt="">
We see an increase in sales across all markets and throughout 2012-2015.
We have high sales volumes in the USCA and LATAM markets:
• USCA: USD 757,108 in 2015;
• LATAM: USD 706,632 in 2015.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F4aa59b5a5b980aad6873c8a4af4cd223%2FPicture1.png?generation=1715109770510368&alt=media" alt="">
Sales across product categories
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F867cbe622bf94d25a25a1c4b9281656d%2FPicture5.png?generation=1715109794950614&alt=media" alt="">
Office supplies were the largely sold product category in 2012-2015. Technology was the least sold product category by quantity. However, the Technology category yields high sales.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F5c74664f77cce2bc2f7c77c7b01e9890%2FPicture6.png?generation=1715109834309500&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2Fd3bb766183e9f58fbf009a998c01adf6%2FPicture7.png?generation=1715109872961254&alt=media" alt="">
Further analysis of profitable products reveals that phones and copiers demonstrate high sales.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F109c4c3eab81fa581c19a5c09beff839%2FPicture9.png?generation=1715109914590660&alt=media" alt="">
Sales across segments
The data reveals that there are high sales in the Consumer segment across all product categories.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F65075cc20028a37a1aff6932fa89d3d5%2FPicture10.png?generation=1715109992655572&alt=media" alt="">
Diagnostic analytics
Two sample T-test
Using a t-test, we can evaluate how sales differ across different segments, regions, and product types. T-test allows us to evaluate the statistical significance of sales samples.
The two-sample t-test of sales numbers across markets resulted in the statistical significance of sales in USCA and LATAM markets with p-values >0.05.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F7b7264d5f44a9a79b352028b28d1c618%2FPicture11.png?generation=1715110082746375&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F4061ef38ea83d7e3bbd252a802863e8f%2FPicture12.png?generation=1715110097203251&alt=media" alt="">
The two-sample t-test of sales numbers across product categories resulted in the statistical significance of sales in Office supplies and Technology categories with p-values >0.05.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2Fd9994377d605222d77ef67af3e273771%2FPicture13.png?generation=1715110126112322&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20744393%2F669779e9aad19d51a28fb44e7c484bc7%2FPicture14.png?generation=1715110140543290&alt=media" alt="">
Pearson correlation The correlation of continuous values in the dataset allows us to see the relationship between sales, quantity sold, shipping costs and profit. https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the Excel population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Excel. The dataset can be utilized to understand the population distribution of Excel by age. For example, using this dataset, we can identify the largest age group in Excel.
Key observations
The largest age group in Excel, AL was for the group of age 45 to 49 years years with a population of 74 (15.64%), according to the ACS 2018-2022 5-Year Estimates. At the same time, the smallest age group in Excel, AL was the 85 years and over years with a population of 2 (0.42%). Source: U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Excel Population by Age. You can refer the same here
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context:This synthetic healthcare dataset has been created to serve as a valuable resource for data science, machine learning, and data analysis enthusiasts. It is designed to mimic real-world healthcare data, enabling users to practice, develop, and showcase their data manipulation and analysis skills in the context of the healthcare industry.
Inspiration:The inspiration behind this dataset is rooted in the need for practical and diverse healthcare data for educational and research purposes. Healthcare data is often sensitive and subject to privacy regulations, making it challenging to access for learning and experimentation. To address this gap, I have leveraged Python's Faker library to generate a dataset that mirrors the structure and attributes commonly found in healthcare records. By providing this synthetic data, I hope to foster innovation, learning, and knowledge sharing in the healthcare analytics domain.
Dataset Information:Each column provides specific information about the patient, their admission, and the healthcare services provided, making this dataset suitable for various data analysis and modeling tasks in the healthcare domain. Here's a brief explanation of each column in the dataset - - Name: This column represents the name of the patient associated with the healthcare record. - Age: The age of the patient at the time of admission, expressed in years. - Gender: Indicates the gender of the patient, either "Male" or "Female." - Blood Type: The patient's blood type, which can be one of the common blood types (e.g., "A+", "O-", etc.). - Medical Condition: This column specifies the primary medical condition or diagnosis associated with the patient, such as "Diabetes," "Hypertension," "Asthma," and more. - Date of Admission: The date on which the patient was admitted to the healthcare facility. - Doctor: The name of the doctor responsible for the patient's care during their admission. - Hospital: Identifies the healthcare facility or hospital where the patient was admitted. - Insurance Provider: This column indicates the patient's insurance provider, which can be one of several options, including "Aetna," "Blue Cross," "Cigna," "UnitedHealthcare," and "Medicare." - Billing Amount: The amount of money billed for the patient's healthcare services during their admission. This is expressed as a floating-point number. - Room Number: The room number where the patient was accommodated during their admission. - Admission Type: Specifies the type of admission, which can be "Emergency," "Elective," or "Urgent," reflecting the circumstances of the admission. - Discharge Date: The date on which the patient was discharged from the healthcare facility, based on the admission date and a random number of days within a realistic range. - Medication: Identifies a medication prescribed or administered to the patient during their admission. Examples include "Aspirin," "Ibuprofen," "Penicillin," "Paracetamol," and "Lipitor." - Test Results: Describes the results of a medical test conducted during the patient's admission. Possible values include "Normal," "Abnormal," or "Inconclusive," indicating the outcome of the test.
Usage Scenarios:This dataset can be utilized for a wide range of purposes, including: - Developing and testing healthcare predictive models. - Practicing data cleaning, transformation, and analysis techniques. - Creating data visualizations to gain insights into healthcare trends. - Learning and teaching data science and machine learning concepts in a healthcare context. - You can treat it as a Multi-Class Classification Problem and solve it for Test Results which contains 3 categories(Normal, Abnormal, and Inconclusive).
Acknowledgments:Image Credit:Image by BC Y from Pixabay
Facebook
TwitterThe Large Truck* Crash Causation Study (LTCCS) is based on a three-year data collection project conducted by the Federal Motor Carrier Safety Administration (FMCSA) and the National Highway Traffic Safety Administration (NHTSA) of the U.S. Department of Transportation (DOT). LTCCS is the first-ever national study to attempt to determine the critical events and associated factors that contribute to serious large truck crashes allowing DOT and others to implement effective countermeasures to reduce the occurrence and severity of these crashes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Five files, one of which is a ZIP archive, containing data that support the findings of this study. PDF file "IA screenshots CSU Libraries search config" contains screenshots captured from the Internet Archive's Wayback Machine for all 24 CalState libraries' homepages for years 2017 - 2019. Excel file "CCIHE2018-PublicDataFile" contains Carnegie Classifications data from the Indiana University Center for Postsecondary Research for all of the CalState campuses from 2018. CSV file "2017-2019_RAW" contains the raw data exported from Ex Libris Primo Analytics (OBIEE) for all 24 CalState libraries for calendar years 2017 - 2019. CSV file "clean_data" contains the cleaned data from Primo Analytics which was used for all subsequent analysis such as charting and import into SPSS for statistical testing. ZIP archive file "NonparametricStatisticalTestsFromSPSS" contains 23 SPSS files [.spv format] reporting the results of testing conducted in SPSS. This archive includes things such as normality check, descriptives, and Kruskal-Wallis H-test results.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Healthcare administrators constantly face difficult questions about costs, patient volume, and resource allocation. Understanding how long patients stay, when admissions spike, and how treatment costs fluctuate can help hospitals plan staffing, negotiate insurance contracts, and manage operational efficiency. In this project, I analyzed hospital admissions and treatment cost data to identify patterns in patient stays, insurance coverage, and seasonal trends. The goal was to explore how hospitals could use data to better understand operational demand and financial performance. Using Microsoft Excel, I conducted exploratory analysis on hospital admission records that included patient demographics, hospital locations, insurance providers, treatment costs, and length of stay. Through PivotTables, formulas, and visualizations, I transformed raw data into insights that reveal how patient volume, insurance distribution, and treatment costs vary across hospitals and over time.
Dataset Overview The dataset includes information on: • Patient admissions across multiple hospitals • Insurance providers and coverage distribution • Hospital stay durations • Treatment cost per day • Monthly admission trends These variables allowed for analysis of both operational hospital metrics and financial performance indicators.
Analysis Approach To explore the dataset, I used Excel tools to summarize large volumes of hospital data and identify patterns. Techniques used included: • PivotTables to aggregate hospital admissions and insurance provider distribution • Conditional formatting to highlight cost changes across time periods • Bar, pie, and line charts to visualize operational trends • Calculations to measure average length of stay and daily treatment costs These tools allowed me to quickly transform transactional hospital data into meaningful visual insights.
Key Findings Cost per Day Trends The average daily treatment cost across hospitals was $3,386.40. Costs fluctuated throughout the year, with the highest treatment costs occurring in September. This spike may reflect seasonal demand for medical procedures, insurance billing cycles, or higher treatment complexity. Following this peak, treatment costs declined during October and November, suggesting a potential normalization in hospital utilization or procedure volume.
Average Length of Stay Across all hospitals, the average patient stay was approximately 16 days. Monthly variations were relatively small but still informative: • April recorded the longest stays, averaging about 15.75 days. • September recorded the shortest stays, averaging about 15.22 days. This distribution suggests that patient stay durations remain relatively stable across the year, though certain months may involve more complex cases or slower discharge cycles.
Insurance Provider Distribution Insurance coverage varied significantly across the patient population. • Cigna covered the largest share of patients, accounting for approximately 20.27% of hospital admissions. • Aetna had the lowest patient share, suggesting smaller network presence or limited coverage in the hospitals represented in the dataset. Additionally, patient length of stay varied slightly by insurance provider. Patients covered by Medicare stayed an average of 15.63 days, while Aetna patients averaged 15.45 days. These differences may reflect variations in patient demographics, treatment complexity, or insurance policy structures.
Seasonal Admission Patterns Admissions also followed a seasonal pattern. • August recorded the highest number of hospital admissions, potentially reflecting increased elective procedures or seasonal health conditions. • February had the lowest admission levels, suggesting lower procedural demand or fewer emergency cases. Understanding these seasonal trends could help hospitals better plan staffing levels and manage resource allocation during high-demand periods.
Business Insights The analysis highlights several opportunities for hospital administrators and healthcare planners: • Rising treatment costs may require closer monitoring of hospital billing practices and insurance reimbursement structures. • Seasonal admission trends can help hospitals anticipate demand and allocate staff more effectively. • Insurance provider distribution may influence strategic partnerships between hospitals and insurers. • Monitoring length-of-stay trends can help identify operational inefficiencies or opportunities to improve discharge planning. By leveraging simple analytical tools in Excel, hospital operations teams can uncover valuable insights that support more informed planning and decision-making.
Skills Demonstrated • Data exploration and cleaning in Excel • PivotTable-based analysis • Trend analysis and statistical summaries • Data visualization using charts and dashboards • Healthcare operational data analysis • Translating raw data into actionable insights
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Supplementary material, R.M.R. van Oosten (accepted 17-8-2019), 'Reconsidering ceramics and trade using big data: the significance of stoneware distribution in the Low Countries, 1200?1600' Medieval Ceramics . The Journal of the Medieval Pottery Research Group (MPRG)1) A bibliography of the publications used, ordered alphabetically per town (an 11-page PDF file); 2) The frequency of stoneware in seven archaeological years, 1250, 1300 in the 38 towns in the Low Countries and 3 towns outside the Low Countries (Excel file with one sheet) 3) The data itself, i.e., the various fabrics per assemblages ordered per town (Excel file with 42 sheets). Fabric codes from the Deventer systeem, Dutch classification system: P. Bitter, S. Ostkamp en N.L. Jaspers, Classificatiesysteem voor (post-)middeleeuws aardewerk en glas= Het Deventer Systeem (sinds 1989), April 2012, 700 pages, p. 4 & 8.
Facebook
TwitterThe annual Retail store data CD-ROM is an easy-to-use tool for quickly discovering retail trade patterns and trends. The current product presents results from the 1999 and 2000 Annual Retail Store and Annual Retail Chain surveys. This product contains numerous cross-classified data tables using the North American Industry Classification System (NAICS). The data tables provide access to a wide range of financial variables, such as revenues, expenses, inventory, sales per square footage (chain stores only) and the number of stores. Most data tables contain detailed information on industry (as low as 5-digit NAICS codes), geography (Canada, provinces and territories) and store type (chains, independents, franchises). The electronic product also contains survey metadata, questionnaires, information on industry codes and definitions, and the list of retail chain store respondents.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Appendix B. Supplementary data
All FE data is available in an excel file.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.
Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.
Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.
The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files.
The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw.
The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey.
The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively.
The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean.
The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Abstract: The aim of this study is to gain insights into the attitudes of the population towards big data practices and the factors influencing them. To this end, a nationwide survey (N = 1,331), representative of the population of Germany, addressed the attitudes about selected big data practices exemplified by four scenarios, which may have a direct impact on the personal lifestyle. The scenarios contained price discrimination in retail, credit scoring, differentiations in health insurance, and differentiations in employment. The attitudes about the scenarios were set into relation to demographic characteristics, personal value orientations, knowledge about computers and the internet, and general attitudes about privacy and data protection. Another focus of the study is on the institutional framework of privacy and data protection, because the realization of benefits or risks of big data practices for the population also depends on the knowledge about the rights the institutional framework provided to the population and the actual use of those rights. As results, several challenges for the framework by big data practices were confirmed, in particular for the elements of informed consent with privacy policies, purpose limitation, and the individuals’ rights to request information about the processing of personal data and to have these data corrected or erased. TechnicalRemarks: TYPE OF SURVEY AND METHODS The data set includes responses to a survey conducted by professionally trained interviewers of a social and market research company in the form of computer-aided telephone interviews (CATI) from 2017-02 to 2017-04. The target population was inhabitants of Germany aged 18 years and more, who were randomly selected by using the sampling approaches ADM eASYSAMPLe (based on the Gabler-Häder method) for landline connections and eASYMOBILe for mobile connections. The 1,331 completed questionnaires comprise 44.2 percent mobile and 55.8 percent landline phone respondents. Most questions had options to answer with a 5-point rating scale (Likert-like) anchored with ‘Fully agree’ to ‘Do not agree at all’, or ‘Very uncomfortable’ to ‘Very comfortable’, for instance. Responses by the interviewees were weighted to obtain a representation of the entire German population (variable ‘gewicht’ in the data sets). To this end, standard weighting procedures were applied to reduce differences between the sample and the entire population with regard to known rates of response and non-response depending on household size, age, gender, educational level, and place of residence. RELATED PUBLICATION AND FURTHER DETAILS The questionnaire, analysis and results will be published in the corresponding report (main text in English language, questionnaire in Appendix B in German language of the interviews and English translation). The report will be available as open access publication at KIT Scientific Publishing (https://www.ksp.kit.edu/). Reference: Orwat, Carsten; Schankin, Andrea (2018): Attitudes towards big data practices and the institutional framework of privacy and data protection - A population survey, KIT Scientific Report 7753, Karlsruhe: KIT Scientific Publishing. FILE FORMATS The data set of responses is saved for the repository KITopen at 2018-11 in the following file formats: comma-separated values (.csv), tapulator-separated values (.dat), Excel (.xlx), Excel 2007 or newer (.xlxs), and SPSS Statistics (.sav). The questionnaire is saved in the following file formats: comma-separated values (.csv), Excel (.xlx), Excel 2007 or newer (.xlxs), and Portable Document Format (.pdf). PROJECT AND FUNDING The survey is part of the project Assessing Big Data (ABIDA) (from 2015-03 to 2019-02), which receives funding from the Federal Ministry of Education and Research (BMBF), Germany (grant no. 01IS15016A-F). http://www.abida.de
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In here all csv files and excel sheets used for statistical analyses are stored.The Excel file 'Raw data' contains a description of each varable in the tab 'metadata'.The second tab contains all the obtained raw data.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
We have used Analytic Hierarchy Process (AHP) to derive the priorities of all the factors in the evaluation framework for open government data (OGD) portals. The results of AHP process were shown in the uploaded pdf file. We have collected 2635 open government datasets of 15 different subject categories (local statistics, health, education, cultural activity, transportation, map, public safety, policies and legislation, weather, environment quality, registration, credit records, international trade, budget and spend, and government bid) from 9 OGD portals in China (Beijing, Zhejiang, Shanghai, Guangdong, Guizhou, Sichuan, XInjiang, Hong Kong and Taiwan). These datasets were used for the evaluation of these portals in our study. The records of the quality and open access of these datasets could be found in the uploaded Excel file.
Facebook
TwitterThis Excel file collates throughput specifications of transmission electron microscopy (TEM) cameras, mass storage, network and memory between 1996 and 2018. The absolute and relative development of the throughput is analyzed in charts.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The main aim of load research is to study the load characteristics through a versatile dashboard that provides overall visualizations for large data sets collected from different regions, different types of consumers. Several tasks related in energy supply industry (ESI) such as generation planning, power dispatch, system operation and control, load shedding etc., require load information and visualizations for both historical and real-time data sets. To support users in carrying out such studies, a versatile load research tool with energy dashboard is developed, using Microsoft Excel platform. To stimulate insights into big data aspects, historical data over a few years is considered. Wide ranging features and possibilities of simulating various operating conditions and customizations have been incorporated into this tool. The tool is made available in the Mendeley data repository, which is a public domain for wider use. We do hope that users and ESI stakeholders find the tool useful. We welcome feedback and comments.
Facebook
TwitterThe USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset: Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided). Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel