Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Insights from City Supply and Demand Data This data project has been used as a take-home assignment in the recruitment process for the data science positions at Uber.
Assignment Using the provided dataset, answer the following questions:
Data Description To answer the question, use the dataset from the file dataset_1.csv. For example, consider the row 11 from this dataset:
Date Time (Local) Eyeballs Zeroes Completed Trips Requests Unique Drivers
2012-09-10 16 11 2 3 4 6
This means that during the hour beginning at 4pm (hour 16), on September 10th, 2012, 11 people opened the Uber app (Eyeballs). 2 of them did not see any car (Zeroes) and 4 of them requested a car (Requests). Of the 4 requests, only 3 complete trips actually resulted (Completed Trips). During this time, there were a total of 6 drivers who logged in (Unique Drivers)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General descriptionThis dataset contains some markers of Open Science in the publications of the Chemical Biology Consortium Sweden (CBCS) between 2010 and July 2023. The sample of CBCS publications during this period consists of 188 articles. Every publication was visited manually at its DOI URL to answer the following questions.1. Is the research article an Open Access publication?2. Does the research article have a Creative Common license or a similar license?3. Does the research article contain a data availability statement?4. Did the authors submit data of their study to a repository such as EMBL, Genbank, Protein Data Bank PDB, Cambridge Crystallographic Data Centre CCDC, Dryad or a similar repository?5. Does the research article contain supplementary data?6. Do the supplementary data have a persistent identifier that makes them citable as a defined research output?VariablesThe data were compiled in a Microsoft Excel 365 document that includes the following variables.1. DOI URL of research article2. Year of publication3. Research article published with Open Access4. License for research article5. Data availability statement in article6. Supplementary data added to article7. Persistent identifier for supplementary data8. Authors submitted data to NCBI or EMBL or PDB or Dryad or CCDCVisualizationParts of the data were visualized in two figures as bar diagrams using Microsoft Excel 365. The first figure displays the number of publications during a year, the number of publications that is published with open access and the number of publications that contain a data availability statement (Figure 1). The second figure shows the number of publication sper year and how many publications contain supplementary data. This figure also shows how many of the supplementary datasets have a persistent identifier (Figure 2).File formats and softwareThe file formats used in this dataset are:.csv (Text file).docx (Microsoft Word 365 file).jpg (JPEG image file).pdf/A (Portable Document Format for archiving).png (Portable Network Graphics image file).pptx (Microsoft Power Point 365 file).txt (Text file).xlsx (Microsoft Excel 365 file)All files can be opened with Microsoft Office 365 and work likely also with the older versions Office 2019 and 2016. MD5 checksumsHere is a list of all files of this dataset and of their MD5 checksums.1. Readme.txt (MD5: 795f171be340c13d78ba8608dafb3e76)2. Manifest.txt (MD5: 46787888019a87bb9d897effdf719b71)3. Materials_and_methods.docx (MD5: 0eedaebf5c88982896bd1e0fe57849c2),4. Materials_and_methods.pdf (MD5: d314bf2bdff866f827741d7a746f063b),5. Materials_and_methods.txt (MD5: 26e7319de89285fc5c1a503d0b01d08a),6. CBCS_publications_until_date_2023_07_05.xlsx (MD5: 532fec0bd177844ac0410b98de13ca7c),7. CBCS_publications_until_date_2023_07_05.csv (MD5: 2580410623f79959c488fdfefe8b4c7b),8. Data_from_CBCS_publications_until_date_2023_07_05_obtained_by_manual_collection.xlsx (MD5: 9c67dd84a6b56a45e1f50a28419930e5),9. Data_from_CBCS_publications_until_date_2023_07_05_obtained_by_manual_collection.csv (MD5: fb3ac69476bfc57a8adc734b4d48ea2b),10. Aggregated_data_from_CBCS_publications_until_2023_07_05.xlsx (MD5: 6b6cbf3b9617fa8960ff15834869f793),11. Aggregated_data_from_CBCS_publications_until_2023_07_05.csv (MD5: b2b8dd36ba86629ed455ae5ad2489d6e),12. Figure_1_CBCS_publications_until_2023_07_05_Open_Access_and_data_availablitiy_statement.xlsx (MD5: 9c0422cf1bbd63ac0709324cb128410e),13. Figure_1.pptx (MD5: 55a1d12b2a9a81dca4bb7f333002f7fe),14. Image_of_figure_1.jpg (MD5: 5179f69297fbbf2eaaf7b641784617d7),15. Image_of_figure_1.png (MD5: 8ec94efc07417d69115200529b359698),16. Figure_2_CBCS_publications_until_2023_07_05_supplementary_data_and_PID_for_supplementary_data.xlsx (MD5: f5f0d6e4218e390169c7409870227a0a),17. Figure_2.pptx (MD5: 0fd4c622dc0474549df88cf37d0e9d72),18. Image_of_figure_2.jpg (MD5: c6c68b63b7320597b239316a1c15e00d),19. Image_of_figure_2.png (MD5: 24413cc7d292f468bec0ac60cbaa7809)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In order to study the effects and outcomes of intergroup dialogue (IGD) on Colgate students, qualitative and quantitative methods were used to collect data. Interviews with 16 Colgate students (two were graduates who currently work at Colgate) were conducted along with data from a post-test survey that was administered to students enrolled in the two Intergroup Dialogue courses. Scales were created using questions and data from the post-test survey to analyze and examine the means of students' answers and the Cronbach Alpha scores. This project used a comparative study as well to examine how levels of exposure to intergroup dialogue pedagogy impacted and effected students. For the comparative study a sample of students was collected from 2 non-IGD diversity courses, 2 partial IGD courses and 2 full IGD courses. Five students were interviewed for the non-IGD courses as well as full IGD courses and six for the courses that used partial IGD. In regards to gender, 13 females were interviewed and 3 males. These methods were used to answer the following questions: 1) what effect does participation in intergroup dialogue have on students attending a liberal arts college? 2) What is the process through which exposure to intergroup dialogue pedagogy leads to these changes in student outcomes? This study solely collected data from Colgate University. This project covers Colgate’s racial climate, past literature, past applicable theories as well as the creation of a new theory, data and methods, quantitative and qualitative results as well a discussion and conclusion section.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Science in (Higher) Education – data of the February 2017 survey
This data set contains:
Full raw (anonymised) data set (completed responses) of Open Science in (Higher) Education February 2017 survey. Data are in xlsx and sav format.
Survey questionnaires with variables and settings (German original and English translation) in pdf. The English questionnaire was not used in the February 2017 survey, but only serves as translation.
Readme file (txt)
Survey structure
The survey includes 24 questions and its structure can be separated in five major themes: material used in courses (5), OER awareness, usage and development (6), collaborative tools used in courses (2), assessment and participation options (5), demographics (4). The last two questions include an open text questions about general issues on the topics and singular open education experiences, and a request on forwarding the respondent's e-mail address for further questionings. The online survey was created with Limesurvey[1]. Several questions include filters, i.e. these questions were only shown if a participants did choose a specific answer beforehand ([n/a] in Excel file, [.] In SPSS).
Demographic questions
Demographic questions asked about the current position, the discipline, birth year and gender. The classification of research disciplines was adapted to general disciplines at German higher education institutions. As we wanted to have a broad classification, we summarised several disciplines and came up with the following list, including the option "other" for respondents who do not feel confident with the proposed classification:
Natural Sciences
Arts and Humanities or Social Sciences
Economics
Law
Medicine
Computer Sciences, Engineering, Technics
Other
The current job position classification was also chosen according to common positions in Germany, including positions with a teaching responsibility at higher education institutions. Here, we also included the option "other" for respondents who do not feel confident with the proposed classification:
Professor
Special education teacher
Academic/scientific assistant or research fellow (research and teaching)
Academic staff (teaching)
Student assistant
Other
We chose to have a free text (numerical) for asking about a respondent's year of birth because we did not want to pre-classify respondents' age intervals. It leaves us options to have different analysis on answers and possible correlations to the respondents' age. Asking about the country was left out as the survey was designed for academics in Germany.
Remark on OER question
Data from earlier surveys revealed that academics suffer confusion about the proper definition of OER[2]. Some seem to understand OER as free resources, or only refer to open source software (Allen & Seaman, 2016, p. 11). Allen and Seaman (2016) decided to give a broad explanation of OER, avoiding details to not tempt the participant to claim "aware". Thus, there is a danger of having a bias when giving an explanation. We decided not to give an explanation, but keep this question simple. We assume that either someone knows about OER or not. If they had not heard of the term before, they do not probably use OER (at least not consciously) or create them.
Data collection
The target group of the survey was academics at German institutions of higher education, mainly universities and universities of applied sciences. To reach them we sent the survey to diverse institutional-intern and extern mailing lists and via personal contacts. Included lists were discipline-based lists, lists deriving from higher education and higher education didactic communities as well as lists from open science and OER communities. Additionally, personal e-mails were sent to presidents and contact persons from those communities, and Twitter was used to spread the survey.
The survey was online from Feb 6th to March 3rd 2017, e-mails were mainly sent at the beginning and around mid-term.
Data clearance
We got 360 responses, whereof Limesurvey counted 208 completes and 152 incompletes. Two responses were marked as incomplete, but after checking them turned out to be complete, and we added them to the complete responses dataset. Thus, this data set includes 210 complete responses. From those 150 incomplete responses, 58 respondents did not answer 1st question, 40 respondents discontinued after 1st question. Data shows a constant decline in response answers, we did not detect any striking survey question with a high dropout rate. We deleted incomplete responses and they are not in this data set.
Due to data privacy reasons, we deleted seven variables automatically assigned by Limesurvey: submitdate, lastpage, startlanguage, startdate, datestamp, ipaddr, refurl. We also deleted answers to question No 24 (email address).
References
Allen, E., & Seaman, J. (2016). Opening the Textbook: Educational Resources in U.S. Higher Education, 2015-16.
First results of the survey are presented in the poster:
Heck, Tamara, Blümel, Ina, Heller, Lambert, Mazarakis, Athanasios, Peters, Isabella, Scherp, Ansgar, & Weisel, Luzian. (2017). Survey: Open Science in Higher Education. Zenodo. http://doi.org/10.5281/zenodo.400561
Contact:
Open Science in (Higher) Education working group, see http://www.leibniz-science20.de/forschung/projekte/laufende-projekte/open-science-in-higher-education/.
[1] https://www.limesurvey.org
[2] The survey question about the awareness of OER gave a broad explanation, avoiding details to not tempt the participant to claim "aware".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Name: Measuring the flexibility achieved by a change of tariff. Summary: This dataset contains the results of a survey carried out by the Spanish electricity retailer GoiEner to assess the impact that a change of from a "flat rate" tariff towards a "time of use" tariff have. Two files are provided: The results of the survey merged with a summary of the energy consumption of the clients before and after the change of tariff. The questions of the survey. License: CC-BY-SA Acknowledge: These data have been collected in the framework of the WHY project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891943. Disclaimer: The sole responsibility for the content of this publication lies with the authors. It does not necessarily reflect the opinion of the Executive Agency for Small and Medium-sized Enterprises (EASME) or the European Commission (EC). EASME or the EC are not responsible for any use that may be made of the information contained therein. Collection Date: Publication Date: December 1, 2022. DOI: 10.5281/zenodo.7382924 Other repositories: None. Author: GoiEner, University of Deusto. Objective of collection: The objective of the data collected is to assess the impact that the change of tariff has on the clients of GoiEner. In particular, the following questions wanted to be answered: How much energy conservation can trigger a change of tariff? How much time flexibility could be trigger with a Time of Use Tariff? What are the main barriers to change the behavoiur? Are there any differences on behaviour depending on the socio-cultural-psycological profile of the consumers? Description: The meaning of each column is described next: Qx_y: Answers to the survey. See the survey file attached (in Spanish) for details. X.z: Answers to question "¿En qué rango de horas se realizan las siguientes acciones en el domicilio?". Sorted from left to right, top to bottom. Idioma: Languaje used to answer the survey. Desea.dar.su.CU: data used to de-anonymize the answers. {P,F,V}{19,20,21}: total energy consumed during the peak, flat and valley period of the day between 6/19 to 5/20 (19), 6/20 to 5/21 (20) and 6/21 to 5/22 (21). T{19,20,21}: total energy consumed between 6/19 to 5/20 (19), 6/20 to 5/21 (20) and 6/21 to 5/22 (21). kpi2_abs: T20 - T21 kpi2_rel: kpi2_abs / T20 kpi1_{P,F,V}{19,20,21}: {P,F,V}{19,20,21} / T{19,20,21} kpi1_{P,F,V}diff: kpi1_{P,F,V}19 - kpi1_{P,F,V}21 T20DHS{19,20,21}: Cost of the energy of during the different periods using the last tariff. T20TD{19,20,21}: Cost of the energy of during the different periods using the new tariff. POWER_TARGET: Energy powerty rist indicator (https://powerpoor.eu/sites/default/files/2022-09/POWERPOOR%20D2.2%20POWER%20TARGET%20v1.0.pdf) TotalEnergyBudget: Self assessment of the energy budget of the house in Euros. Invoices20{20,21,22}: Amount of all the invoices for the house during the different periods. min30{in,pre,pst}: Cluster assigned depending on its electric behaviour pre-COVID, during the COVID lockdowns and post COVID lockdowns. See 10.5281/zenodo.7382818 for details. 5 star: ⭐⭐⭐ Preprocessing steps: Data integration (from different sources from GoiEner services); data transformation (anonymization, unit conversion, metadata generation). Reuse: This dataset is related to datasets: "A database of features extracted from different electricity load profiles datasets" (DOI 10.5281/zenodo.7382818), where time series feature extraction has been performed. Update policy: There might be a single update in mid-2023 with a repetition of the survey as there have been another change of tariff in Spain. Ethics and legal aspects: The data provided by GoiEner contained values of the CUPS (Meter Point Administration Number), which are personal data. A pre-processing step has been carried out to replace the CUPS by random 64-character hashes. Technical aspects: the survey is provided as a PDF file and the data as a CSV file compressed with zstandard. Other: None.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
!!!WARNING~~~This dataset has a large number of flaws and is unable to properly answer many questions that people generally use it to answer, such as whether national hate crimes are changing (or at least they use the data so improperly that they get the wrong answer). A large number of people using this data (academics, advocates, reporting, US Congress) do so inappropriately and get the wrong answer to their questions as a result. Indeed, many published papers using this data should be retracted. Before using this data I highly recommend that you thoroughly read my book on UCR data, particularly the chapter on hate crimes (https://ucrbook.com/hate-crimes.html) as well as the FBI's own manual on this data. The questions you could potentially answer well are relatively narrow and generally exclude any causal relationships. ~~~WARNING!!!Version 8 release notes:Adds 2019 dataVersion 7 release notes:Changes release notes description, does not change data.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), made all character values lower case, reordered columns. I also generated incident month, weekday, and month-day variables from the incident date variable included in the original data.
Facebook
TwitterWork release programs allow selected prisoners nearing the end of their terms to work in the community, returning to correctional facilities or community residential facilities in nonworking hours. This project was designed as both a randomized and quasi-experimental field study to assess the effectiveness of work release in the Seattle area. It evaluated the impact of work release sentencing on recidivism and on corrections costs by comparing a sample of inmates who participated in work release with a comparable sample of inmates who completed their sentences in prison. The study was designed to answer the following questions: (1) What are the background and offense characteristics of offenders assigned to work release in the Seattle area? (2) What types of services are received by offenders in work release? and (3) How does the community experience of work release participants compare to that of similar offenders discharged directly into the Seattle community without having gone through work release? For each offender, detailed information was collected on measures relating to work release participation and recidivism outcomes. Information was gathered from Department of Corrections institutional files, work release program records, computerized payment information for legal and financial obligations, and statewide criminal history records. For each offender, background and six- and twelve-month reviews were completed. Part 1, Background Data, supplies variables that cover inmate demographics, employment history, drug use, current offense, prior criminal history, and risk/needs items. Part 2, Drug Testing Data, lists the types of drugs tested for, types of drugs for which there were positive results, and sanctions for drug use. Part 3, Offender Status Data, provides information on inmates' supervision status and the types of programs they participated in. Part 4, Prison Data, includes the number of days spent at different institutions and prerelease centers, work assignment, and prison infractions. Part 5, Work Release Data, contains information on the number of days spent at different work release facilities and any time spent in jail or on escape status while in work release. Data in this file also cover contacts and services received during work release, including personal and phone contacts between the work release participant and community corrections officer at the job and other sites, monitoring checks (employment verification, criminal records checks), sessions in outpatient counseling (drug, alcohol, family, other), employment (number of attempted and completed job interviews, primary job classification, length of employment, wages, and reason left), drug testing (date and type of test, type of positives, sanction imposed), infractions during work release and their sanctions, and arrests and their sanctions. Part 6, Community Placement Data, provides variables on the number of days each month that the offender was on the street, in work release, in pretrial detention, or in other custody, while Part 7, Post-Release Data, focuses on the number of days each month that the offender was on the street, in pretrial detention, or in prison or jail after being released from the work release program. Variables in Part 8, Infractions Data, pertain to the number and types of infractions and associated sanctions. Part 9, Recidivism Data, provides information on each offense after discharge from the program, including the date of the offense, nature of arrest, disposition, and sentence.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Case Study: How Can a Wellness Technology Company Play It Smart?
This is my first case study as a data analyst using Excel, Tableau, and R. This case study is a part of my Google Data Analytics Professional Certification. I know there may be some insights presented differently or any insights might not be covered as per the point of view of the reader who can provide feedback. Feedback will be appreciated.
Scenario:
The Bellabeat data analysis case study! In this case, the study is to perform the real-world tasks of a junior data analyst. Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company and present analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
The Case Study Roadmap followed, In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.
Ask:
Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation.
These questions will guide your analysis:
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat's marketing strategy?
To produce a report with the following deliverables:
1. A clear summary of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of your analysis
5. Supporting visualizations and key findings
6. Your top high-level content recommendations based on your analysis
Prepare: includes Dataset used, Accessibility and privacy of data, Information about our dataset, Data organization and verification, Data credibility and integrity.
The dataset used for analysis is from Kaggle, which is considered a reliable source. Dataset owner Sršen encourages to use of public data that explores smart device users’ daily habits. She points you to a specific data set: Fitbit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness trackers from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. Sršen tells that this data set might have some limitations, and encourages us to consider adding other data to help address those limitations by beginning to work more with this data. But, this analysis only confined primarily to the present dataset and has not yet been done analysis by adding other data to address any limitations of this dataset. I may take up later to collect additional datasets based on the availability of those datasets for individual analyst circumstances since companies provide datasets that are needed, may be available on a subscription basis or I need to search and access for similar product datasets. That is my limitation to confine my analysis to this dataset only.
Process Phase:
1. Tools used for Analysis: Excel, Tableau, R studio, Kaggle
2. Cleaning of Data: includes removal of duplication of data but data itself by its nature includes Id, dates include repetition and also there are zero values by nature of recording since human beings are body and mind are complex, so the possibility of zero values inherent in data or any other reason yet to be known but an analysis done based on available data though which is not correct for live projects where someone available to discuss them.
3. Analysis was done based on available variables.
Analyze Phase:
Id Avg.VeryActiveDistance Avg.ModerateActiveDistance Avg.LightActiveDistance
TotalDistance Avg.Calories
1927972279 0.09580645 0.031290323 0.050709677
2026352035 0.006129032 0.011290322 3.43612904
3977333714 1.614999982 2.75099979 3.134333344
8053475328 8.514838742 0.423870965 2.533870955
8877689391 6.637419362 0.337741935 6.188709674 3420.258065 409.5...
Facebook
TwitterThese data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed. This study examined the impact of technology on social, organizational, and behavioral aspects of policing. The present data represents an officer-level survey of four law enforcement agencies, designed to answer the following questions: (1) how are technologies used in police agencies across ranks and organizational sub-units? (2) how does technology influence organizational and personal aspects of police including - operations, culture, behavior, and satisfaction? (3) how do organizational and individual aspects of policing concurrently shape the use and effectiveness of technology? (4) how does technology affect crime control efforts and police-community relationships? (5) what organizational practices help to optimize the use of technology with an emphasis on enhance effectiveness and legitimacy?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Hr Data Analytics This dataset contains information about employees who worked in a company.
This dataset contains columns: Satisfactory Level, Number of Project, Average Monthly Hours, Time Spend Company, Promotion Last 5
Years, Department, Salary
You can download, copy and share this dataset for analysis and Predictions employees Behaviour.
Answer the following questions would be worthy 1- Do Exploratory Data analysis to figure out which variables have a direct and clear impact on employee retention (i.e. whether they leave the company or continue to work) 2- Plot bar charts showing the impact of employee salaries on retention 3- Plot bar charts showing a correlation between department and employee retention 4- Now build a logistic regression model using variables that were narrowed down in step 1 5- Measure the accuracy of the model
Facebook
TwitterThe goal of this study was to identify factors that influence whether city misdemeanor domestic violence cases in which batterers are arrested by police result in dismissals, acquittals, or convictions in the courts, and how these cases are processed. The researchers sought to examine factors that influence court officials' decision-making in domestic violence cases, as well as factors that influence victim and witness reluctance in bringing batterers to successful adjudication. In Part 1 researchers merged pretrial services data with information from police and prosecutors' reports in the urban area under study to answer the following questions: (1) What is the rate of dismissals, acquittals, and convictions for misdemeanor court cases and what are the conditions of these sentences? (2) What factors in court cases are significantly related to whether the disposition is a dismissal, acquittal, or conviction, and how are these cases processed? In Part 2, judges, prosecutors, and public defenders were asked detailed questions about their level of knowledge about, attitudes toward, and self-reported behaviors regarding the processing of domestic violence cases to find out: (1) What roles do legal and extra-legal factors play in decision-makers' self-reported behaviors and attitudes? (2) How do decision-makers rate victim advocate and batterer treatment programs? (3) How do court professionals view the victim's role in the court process? and (4) To what degree do court professionals report victim-blaming attitudes and experiences? For Part 3 researchers used a stratified random sample to select court cases of misdemeanor domestic violence that would be transcribed and used for a content analysis to examine: (1) Who speaks in court and how? and (2) What is considered relevant by different court players? In Parts 4-103 victim surveys and interviews were administered to learn about battered women's experiences in both their personal lives and the criminal processing system. Researchers sought to answer the following questions: (1) How do victim/witnesses perceive their role in the prosecution of their abusers? (2) What factors inhibit them from pursuing prosecution? (3) What factors might help them pursue prosecution? and (4) How consistent are the victims'/witnesses' demographic and psychological profiles with existing research in this area? Domestic violence victims attending arraignment between January 1 and December 31 of 1997 were asked to complete surveys to identify their concerns about testifying against their partners and to evaluate the effectiveness of the court system in dealing with domestic violence cases (Part 4). The disposition of each case was subsequently determined by a research team member's examination of defendants' case files and/or court computer files. Upon case closure victims who had both completed a survey and indicated a willingness to be interviewed were contacted to participate in an interview (Parts 5-103). Variables in Part 1, Pretrial Services Data, include prior criminal history, current charges, case disposition, sentence, victim testimony, police testimony, victim's demeanor at trial, judge's conduct, type of abuse involved, weapons used, injuries sustained, and type of evidence available for trial. Demographic variables include age, sex, and race of defendants, victims, prosecutors, and judges. In Part 2, Professional Survey Data, respondents were asked about their tolerance for victims and offenders who appeared in court more than once, actions taken when substance abuse was involved, the importance of injuries in making a decision, attitudes toward battered women, the role of victim advocates and the police, views on restraining orders, and opinion on whether arrest is a deterrent. Demographic variables include age, sex, race, marital status, and years of professional experience. Variables in Part 3, Court Transcript Data, include number and type of charges, pleas, reasons for dismissals, types of evidence submitted by prosecutors and defense, substance abuse by victim and defendant, living arrangements and number of children of victim and defendant, specific type of abuse, injuries sustained, witnesses to injuries, police testimony, verdict, and sentence. Demographic variables include age and sex of defendant and victim and relationship of victim and defendant. In Part 4, Victim Survey Data, victims were asked about their relationship and living arrangements with the defendant, concerns about testifying in court, desired outcomes of case and punishment for defendant, emotional issues related to abuse, health problems, substance abuse, support networks, other violent domestic incidents and injuries, and safety concerns. Part 5 variables measured victims' safety at different stages of the criminal justice process and danger experienced due to further violent incidents, presence of weapons, and threats of homicide or suicide. Parts 6-103 contain the qualitative interview data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
E-commerce has become a new channel to support businesses development. Through e-commerce, businesses can get access and establish a wider market presence by providing cheaper and more efficient distribution channels for their products or services. E-commerce has also changed the way people shop and consume products and services. Many people are turning to their computers or smart devices to order goods, which can easily be delivered to their homes.
This is a sales transaction data set of UK-based e-commerce (online retail) for one year. This London-based shop has been selling gifts and homewares for adults and children through the website since 2007. Their customers come from all over the world and usually make direct purchases for themselves. There are also small businesses that buy in bulk and sell to other customers through retail outlet channels.
The data set contains 500K rows and 8 columns. The following is the description of each column. 1. TransactionNo (categorical): a six-digit unique number that defines each transaction. The letter “C” in the code indicates a cancellation. 2. Date (numeric): the date when each transaction was generated. 3. ProductNo (categorical): a five or six-digit unique character used to identify a specific product. 4. Product (categorical): product/item name. 5. Price (numeric): the price of each product per unit in pound sterling (£). 6. Quantity (numeric): the quantity of each product per transaction. Negative values related to cancelled transactions. 7. CustomerNo (categorical): a five-digit unique number that defines each customer. 8. Country (categorical): name of the country where the customer resides.
There is a small percentage of order cancellation in the data set. Most of these cancellations were due to out-of-stock conditions on some products. Under this situation, customers tend to cancel an order as they want all products delivered all at once.
Information is a main asset of businesses nowadays. The success of a business in a competitive environment depends on its ability to acquire, store, and utilize information. Data is one of the main sources of information. Therefore, data analysis is an important activity for acquiring new and useful information. Analyze this dataset and try to answer the following questions. 1. How was the sales trend over the months? 2. What are the most frequently purchased products? 3. How many products does the customer purchase in each transaction? 4. What are the most profitable segment customers? 5. Based on your findings, what strategy could you recommend to the business to gain more profit?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
Facebook
Twitterhttps://vocab.nerc.ac.uk/collection/L08/current/UN/https://vocab.nerc.ac.uk/collection/L08/current/UN/
Standardisation of River Classifications: Framework method for calibrating different biological survey results against ecological quality classifications to be developed for the Water Framework Directive. Problems to be solved: The variety of assessment methods for streams and rivers in Europe provides good opportunities for implementing the Water Framework Directive but their diversity may also result in serious strategic problems. The number of organism groups that will be used to assess Ecological Status, and the number of methods available for doing so are so diverse that inter-calibration and standardisation of methods is crucial. Similarly, protocols need to be devised to integrate the information gathered on the different taxonomic groups. The project aims to derive a detailed picture of which methods are best suited for which circumstances as a basis for standardisation. We propose to develop a standard for determining class boundaries of Ecological Status and another for inter-calibrating existing methods. Scientific objectives and approach: Data will be used to answer the following questions, which form the basis of a conceptual model: 1) How can data resulting from different assessment methods be compared and standardised? 2) Which methods/taxonomic groups are most capable of indicating particular individual stressors? 3) Which method can be used on which scale? 4) Which method is suited for early and late warnings? 5) How are different assessment methods affected by errors? 6) What can be standardised and what should be standardised? For the purposes of this project two 'core streams types' are recognised: small, shallow, upland streams and medium-sized, deeper lowland streams. Besides the evaluation of existing data, a completely new data set is sampled to gain comparable data on macroinvertebrates, phytobenthos, fish and stream morphology taken with a set of different methods from sites representing different stages of degradation. This will be the main source of data for cross-comparisons and the preparation of standards. A number of 'additional stream types' will be investigated in order to extend the range of sites at which field methods and assessment procedures are compared. The participants will be trained in sampling workshops and quality assurance will be implemented through an audit. Using the project database, assessment methods based on benthic macroinvertebrates will be compared and inter-calibrated, particularly in terms of errors, precision, relation to reference conditions and possible class boundaries. The discriminatory power of different organism groups to detect ecological change will be tested through various statistical procedures. Two CEN Workshops will be held during the contracted period. These will result in the formulation of draft standards for circulation, amendment, agreement by participating countries in CEN.STAR will benefit from clustering with the complementary Framework V Project, FAME. Project FAME will develop European fish assessment protocols using existing data. STAR fish sampling will be based on FAME protocols and STAR field data will be used by FAME to test these new protocols. Expected impacts: The project will provide a general concept understanding of how to use different organism groups for stream assessment. The project findings will be implemented through a decision support system. Existing methods based on benthic macroinvertebrates will be inter-calibrated to enable a future comparison of river quality classes throughout Europe. Existing assessment methods will be supplemented by an 'error module'. A matrix of possible class boundaries of grades of 'Ecological Status' associated with different methods and stressors will be developed. Committee drafts for the relevant CEN working group and draft standards on stream assessment methods will be produced. Deliverables: Please see: www.eu-star.at/frameset.htm
Facebook
Twitterhttps://www.usa.gov/government-workshttps://www.usa.gov/government-works
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.
Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.
This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.
The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.
For more information:
NNDSS Supports the COVID-19 Response | CDC.
The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.
All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.
To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.
CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.
For questions, please contact Ask SRRG (eocevent394@cdc.gov).
COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Professional organizations in STEM (science, technology, engineering, and mathematics) can use demographic data to quantify recruitment and retention (R&R) of underrepresented groups within their memberships. However, variation in the types of demographic data collected can influence the targeting and perceived impacts of R&R efforts - e.g., giving false signals of R&R for some groups. We obtained demographic surveys from 73 U.S.-affiliated STEM organizations, collectively representing 712,000 members and conference-attendees. We found large differences in the demographic categories surveyed (e.g., disability status, sexual orientation) and the available response options. These discrepancies indicate a lack of consensus regarding the demographic groups that should be recognized and, for groups that are omitted from surveys, an inability of organizations to prioritize and evaluate R&R initiatives. Aligning inclusive demographic surveys across organizations will provide baseline data that can be used to target and evaluate R&R initiatives to better serve underrepresented groups throughout STEM. Methods We surveyed 164 STEM organizations (73 responses, rate = 44.5%) between December 2020 and July 2021 with the goal of understanding what demographic data each organization collects from its constituents (i.e., members and conference-attendees) and how the data are used. Organizations were sourced from a list of professional societies affiliated with the American Association for the Advancement of Science, AAAS, (n = 156) or from social media (n = 8). The survey was sent to the elected leadership and management firms for each organization, and follow-up reminders were sent after one month. The responding organizations represented a wide range of fields: 31 life science organizations (157,000 constituents), 5 mathematics organizations (93,000 constituents), 16 physical science organizations (207,000 constituents), 7 technology organizations (124,000 constituents), and 14 multi-disciplinary organizations spanning multiple branches of STEM (131,000 constituents). A list of the responding organizations is available in the Supplementary Materials. Based on the AAAS-affiliated recruitment of the organizations and the similar distribution of constituencies across STEM fields, we conclude that the responding organizations are a representative cross-section of the most prominent STEM organizations in the U.S. Each organization was asked about the demographic information they collect from their constituents, the response rates to their surveys, and how the data were used. Survey description The following questions are written as presented to the participating organizations. Question 1: What is the name of your STEM organization? Question 2: Does your organization collect demographic data from your membership and/or meeting attendees? Question 3: When was your organization’s most recent demographic survey (approximate year)? Question 4: We would like to know the categories of demographic information collected by your organization. You may answer this question by either uploading a blank copy of your organization’s survey (linked provided in online version of this survey) OR by completing a short series of questions. Question 5: On the most recent demographic survey or questionnaire, what categories of information were collected? (Please select all that apply)
Disability status Gender identity (e.g., male, female, non-binary) Marital/Family status Racial and ethnic group Religion Sex Sexual orientation Veteran status Other (please provide)
Question 6: For each of the categories selected in Question 5, what options were provided for survey participants to select? Question 7: Did the most recent demographic survey provide a statement about data privacy and confidentiality? If yes, please provide the statement. Question 8: Did the most recent demographic survey provide a statement about intended data use? If yes, please provide the statement. Question 9: Who maintains the demographic data collected by your organization? (e.g., contracted third party, organization executives) Question 10: How has your organization used members’ demographic data in the last five years? Examples: monitoring temporal changes in demographic diversity, publishing diversity data products, planning conferences, contributing to third-party researchers. Question 11: What is the size of your organization (number of members or number of attendees at recent meetings)? Question 12: What was the response rate (%) for your organization’s most recent demographic survey? *Organizations were also able to upload a copy of their demographics survey instead of responding to Questions 5-8. If so, the uploaded survey was used (by the study authors) to evaluate Questions 5-8.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: Data for the study from Bollenbach, L., Niermann, C., Schmitz, J., & Kanning, M. 2023. Social Participation in the city. All descriptives and variables that were used for the calculation of the correlations, t-tests, and multigroup path analyses are included in this data set. The provided information is sourced from the article "Social participation in the city: exploring the moderating effect of walkability on the associations between active mobility, neighborhood perceptions, and social activities in urban adults" by Bollenbach, Niermann, Schmitz & Kanning, published in BMC Public Health, DOI: 10.1186/s12889-023-17366-0. All details are subject to the applicable availabilities and rights. The corresponding references can be found in the published article. Method: Study design: A cross-sectional online questionnaire was conducted and implemented via the German online platform ‘SoSci Survey’ [43]. The data were collected between July and December 2020 in the city of Stuttgart, Germany. For clarification, data collection took time during the COVID-19 pandemic, but no serious restrictions (e.g. curfews) were in place in the data collection timeframe. The first page of the questionnaire contained information about the study, its goals, data privacy protection, and participants’ rights in this context. To participate in the study, participants had to give informed consent that they were willing to participate and had read and understood the study information. However, this study includes only a part of all collected data (see ‘Measures’). The survey was in German. Method: Recruitment of the study participants: The individuals who participated in this study were recruited via the distribution of 3000 letters in 12 pre-selected residential areas in the city of Stuttgart, Germany. The letters contained information about the study background, a QR code to directly participate in the online questionnaire, and information about the option to participate via a paper-pencil questionnaire. Inclusion criteria were to live in the study area, to be at least 18 years old, and to understand German. The study sample is described in Table 1. The residential areas were pre-selected based on the objective walkability in the respective residential area, resulting in six residential areas with low walkability, and six residential areas with high walkability (see Fig. 1). The walkability scores for the classification of the pre-selected areas into low- and high walkability were derived via the first version of the ILS-Walkability-Index [35]. Method: Measures: The following measures were derived and used in the data analyses to answer the research questions. Method: Social participation: To measure social participation, the scale used by Levasseur et al. [17] was adopted, which operationalizes social participation as participants’ frequency of monthly engagement in 10 different social activities. The response options to the question “How often are you involved in the following activities?” were rated on a 5-point Likert scale with the following indications: 1 (“never”), 2 (“less than once a month”), 3 (“at least once a month”), 4 (“at least once a week”), and 5 (“almost every day”). After data collection, the response options were converted into frequencies per month per activity (“almost every day” = 20; “at least once a week” = 6; “at least once a month” = 2; “less than once a month” = 1; and “never” = 0, respectively). In a final step, the frequencies from all 10 activities were summed, which resulted in the final social participation score that constitutes the number of social activities per individual per month with a theoretical range of 0-200 (note, it is hardly possible to be involved in every social activity on every day) [17]. One example for the question and a response option is as follows (see ‘Additional file 3’ for further information): ‘How often are you involved in the following activities?’ 1. Visit family members/friends. Method: Active mobility: To assess individuals’ level of active mobility, the validated ‘Physical Activity, Exercise, and Sport Questionnaire’ was used [44]. The questionnaire assesses various types of physical activities (such as everyday life activities, e.g., walking/cycling to work or leisure, household activities), exercises (for the purpose of physical activity itself, e.g., running, hiking), and sports (a more specific sport, often with a competitive character, e.g., soccer, track and field athletics). However, as for this study only walking and bicycling to work, for leisure, and for recreational purposes (active mobility) was of interest, only these measures were utilized. This resulted in a total of five items (1, walking to work; 2, walking to the grocery store; 3, bicycling to work; 4, bicycling for other transportation purposes; and 5, walking for recreation/strolling) that were summed up to the measure of active mobility. The items were assessed in the following manner [44]. After the introduction question “On how many days, and for how long have you conducted the following activities in the last four weeks?”, participants answered in cloze-type-questions, for example (see ‘Additional file 1’ for further information): Walking to work (also partial sections): On _ days during the 4 weeks and approximately _ minutes per day. With the information from the first (number of days of the respective activity) and the second (performed minutes per respective activity) response, the active mobility per month per participant (unit: minutes of active mobility per month per participant) was calculated and used in the analyses. This was done by summing up the products from each multiplication of days and minutes for 1, 2, 3, 4, and 5, respectively. Method: Neighborhood perceptions: Participants’ subjective neighborhood perceptions, i.e., their subjectively perceived satisfaction with the neighborhood environment, were measured via selected questions from the validated ‘Neighborhood Environment Walkability Scale - Germany’ (NEWS-G [4546]). To be precise, 10 questions from the subcategory ‘I’ (‘satisfaction with the neighborhood environment’) were assessed (see ‘Additional file 2’). The participants answered the questions regarding their satisfaction with different environmental features on a 5-point Likert scale, with answers ranging from 1 (“very unsatisfied”) to 5 (“very satisfied”). The final scale for analyses resulted from the mean of the answers. The scale had acceptable reliability with a Cronbach’s Alpha of 0.74. One example for a question and response is as follows (see ‘Additional file 2’ for further information): ‘How satisfied are you with…’… the possibility to walk in your neighborhood environment? Note that ‘subjectively perceived satisfaction with the neighborhood environment’ is abbreviated to ‘neighborhood perceptions’ in the rest of the manuscript to increase readability. Method: Walkability: First, the walkability measure that was used for the initial pre-selection of the 12 residential areas for participant recruitment was rechecked and updated with an adapted and improved version of the walkability measure that wasn’t available at the time of the initial data collection. We used the Walkability-Index from the Research Institute for Regional and Urban Development (= ‘Institut für Landes- und Stadtentwicklungsforschung’, ILS; ‘ILS-Walkability-Index’) to measure the objective walkability. The index was refined in the project ‘AMbit - Active Mobility’ [47] and is based on the basic concept of the original Walkability-Index, which was developed by Dobešová and Křivka [39]. We used new technical possibilities such as precise routing and open data [35]. Generation of the measure was done as follows: This objective walkability for the city of Stuttgart was determined using QGIS (a free and Open Source Geographic Information System software) to calculate the ILS-Walkability Index [35]. We calculated the walkability city-wide on a 500m by 500m grid and checked in which grid the participants live. For each cell of the grid, a score was calculated. The ILS-Walkability-Index consists of four dimensions: The permeability of the pedestrian network (data source: OpenStreetMap, European Digital Elevation Model), the proportion of green spaces (data source: OpenStreetMap), the population density (data source: German Zensus, 2011), and the availability of amenities (data source: OpenStreetMap) within walking distance. The permeability of the pedestrian network shows the area that a person can reach when walking 500m in any direction along the pedestrian network starting from the center of each cell. The result is a polygon – the so-called pedestrian shed. It is put in relation to the theoretical maximum size of the pedestrian shed – a circle with a radius of 500m. The higher the proportion is, the more permeable the pedestrian network is. An elevation model serves as a correction factor: The more meters of altitude, the smaller the pedestrian shed is. The proportion of green space is the proportion of the pedestrian shed that is covered with green space. Population density is derived from the number of residents living within the pedestrian shed. The accessibility of amenities is based on calculations of the distance along the walking network to different amenities such as supermarkets, schools, or restaurants. The closer and more numerous the amenities are, the higher the rating is. All four dimensions (permeability of the pedestrian network, green space, population density, amenities) are scaled from 0 to 10 and added together to the ILS-Walkability Score. Because population density correlates with the amenity-score (where many people live, there are a greater number of stores), a weight of 0.5 was applied to population density, while a
Facebook
TwitterNational, regional
Households
Sample survey data [ssd]
The 2020/21 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement.
After data processing, the final sample size for Round 5 is 3,922 households.
Computer Assisted Telephone Interview [cati]
The questionnaire for this round consisted of the following sections
Section 2. Behavior Section 3. Health Section 4. Education Section 5. Employment (main respondent) Section 6. Coping Section 8. FIES Section 10. Opinion
Note: Some categorical responses have been merged in the anonymized data set for confidentiality.
Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps:
• Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese.
• Remove unnecessary variables which were automatically calculated by SurveyCTO
• Remove household duplicates in the dataset where the same form is submitted more than once.
• Remove observations of households which were not supposed to be interviewed following the identified replacement procedure.
• Format variables as their object type (string, integer, decimal, etc.)
• Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer.
• Correct data based on supervisors’ note where enumerators entered wrong code.
• Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings.
• Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form.
• Label variables using the full question text.
• Label variable values where necessary.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ARabic Dataset for Automatic Short Answer Grading Evaluation V1. ISLRN 529-005-230-448-6.
Our dataset consists of reported evaluations relate to answers submitted for three different exams submitted to three classes of students.
The exams were conducted under natural conditions of evaluation.
Each test consists of 16 short answer questions (a total of 48 questions).
For each question, a model answer is proposed.
Students submitted answers to these questions.
The number of answers obtained is different from one question to another.
The dataset includes a total of 2133 pairs (Model Answer, student answer).
the Dataset encompasses 5 types of questions:
• "عرف ": Define?
• "إشرح": Explain?
• "ما النتائج المترتبة على": What consequences?
• "علل": Justify?
• "ما الفرق": What is the difference
AR-ASAG Dataset is available in different versions: TXT, XML, XML-MOODLE and Database (.DB).
The .DB format allows making the necessary exports according to specific analysis needs.
The XML-MOODLE format is used on Moodle e-learning Platforms
For each pair, two grades (Mark1 and Mark2 ) are associated with a manual Average Gold Score
Both manual grades are available in the dataset.
Inter-Annotators Agreement: - (Pearson Correlation: r=0.8384)
- (Root Mean Square Error : RMSE=0.8381).
The Dataset can be also used for essay scoring as the students's answers responses take to reach 4-5 sentences.
The Dataset exist in TXT, XML, XML-MOODLE Versions
The name of the file is representative of its content.
We use the term "Mark" to specify "Grade"
For privacy reasons, no student identifiers are used in this Dataset.
Facebook
Twitterhttps://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
Making clinical audit data transparent In his transparency and open data letter to Cabinet Ministers on 7 July 2011, the Prime Minister made a commitment to make clinical audit data available from the national audits within the National Clinical Audit and Patient Outcomes Programme. Each year data from the National Diabetes Inpatient Audit will be made available in CSV format. The data are also being made available on the data.gov website. What information is being made available? - Audit participation by NHS Trust and data completeness for the key fields. - Measures about the process of care given to patients. - Information about care outcomes and treatment. Trusts and Networks are identified by name and their national code. These data do not list individual patient information, nor do they contain any patient identifiable data. The National Diabetes Inpatient Audit (NaDIA) is commissioned by the Healthcare Quality Improvement Partnership (HQIP) and delivered by the Health and Social Care Information Centre, working with Diabetes UK. The National Diabetes Inpatient Audit is a snapshot audit of diabetes inpatient care in England and Wales. The audit is set out to answer the following questions: - Did diabetes management minimise the risk of avoidable complications? - Did harm result from the inpatient stay? - Was patient experience of the inpatient stay favourable? - Has the quality of care and patient feedback changed since NaDIA 2010, NaDIA 2011 and NaDIA 2012? This publication was produced in two parts: the comparative hospital-level analysis was published 5 March 2014 and the National Report was published 26 June 2014.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Insights from City Supply and Demand Data This data project has been used as a take-home assignment in the recruitment process for the data science positions at Uber.
Assignment Using the provided dataset, answer the following questions:
Data Description To answer the question, use the dataset from the file dataset_1.csv. For example, consider the row 11 from this dataset:
Date Time (Local) Eyeballs Zeroes Completed Trips Requests Unique Drivers
2012-09-10 16 11 2 3 4 6
This means that during the hour beginning at 4pm (hour 16), on September 10th, 2012, 11 people opened the Uber app (Eyeballs). 2 of them did not see any car (Zeroes) and 4 of them requested a car (Requests). Of the 4 requests, only 3 complete trips actually resulted (Completed Trips). During this time, there were a total of 6 drivers who logged in (Unique Drivers)