Facebook
TwitterThis dataset contains the geographic data used to create maps for the San Diego County Regional Equity Indicators Report led by the Office of Equity and Racial Justice (OERJ). The full report can be found here: https://data.sandiegocounty.gov/stories/s/7its-kgpt
Demographic data from the report can be found here: https://data.sandiegocounty.gov/dataset/Equity-Report-Data-Demographics/q9ix-kfws
Filter by the Indicator column to select data for a particular indicator map.
Export notes: Dataset may not automatically open correctly in Excel due to geospatial data. To export the data for geospatial analysis, select Shapefile or GEOJSON as the file type. To view the data in Excel, export as a CSV but do not open the file. Then, open a blank Excel workbook, go to the Data tab, select “From Text/CSV,” and follow the prompts to import the CSV file into Excel. Alternatively, use the exploration options in "View Data" to hide the geographic column prior to exporting the data.
USER NOTES: 4/7/2025 - The maps and data have been removed for the Health Professional Shortage Areas indicator due to inconsistencies with the data source leading to some missing health professional shortage areas. We are working to fix this issue, including exploring possible alternative data sources.
5/21/2025 - The following changes were made to the 2023 report data (Equity Report Year = 2023). Self-Sufficiency Wage - a typo in the indicator name was fixed (changed sufficienct to sufficient) and the percent for one PUMA corrected from 56.9 to 59.9 (PUMA = San Diego County (Northwest)--Oceanside City & Camp Pendleton). Notes were made consistent for all rows where geography = ZCTA. A note was added to all rows where geography = PUMA. Voter registration - label "92054, 92051" was renamed to be in numerical order and is now "92051, 92054". Removed data from the percentile column because the categories are not true percentiles. Employment - Data was corrected to show the percent of the labor force that are employed (ages 16 and older). Previously, the data was the percent of the population 16 years and older that are in the labor force. 3- and 4-Year-Olds Enrolled in School - percents are now rounded to one decimal place. Poverty - the last two categories/percentiles changed because the 80th percentile cutoff was corrected by 0.01 and one ZCTA was reassigned to a different percentile as a result. Low Birthweight - the 33th percentile label was corrected to be written as the 33rd percentile. Life Expectancy - Corrected the category and percentile assignment for SRA CENTRAL SAN DIEGO. Parks and Community Spaces - corrected the category assignment for six SRAs.
5/21/2025 - Data was uploaded for Equity Report Year 2025. The following changes were made relative to the 2023 report year. Adverse Childhood Experiences - added geographic data for 2025 report. No calculation of bins nor corresponding percentiles due to small number of geographic areas. Low Birthweight - no calculation of bins nor corresponding percentiles due to small number of geographic areas.
Prepared by: Office of Evaluation, Performance, and Analytics and the Office of Equity and Racial Justice, County of San Diego, in collaboration with the San Diego Regional Policy & Innovation Center (https://www.sdrpic.org).
Facebook
Twitterhttp://spdx.org/licenses/NLOD-2.0http://spdx.org/licenses/NLOD-2.0
The data sets provide an overview of selected data on waterworks registered with the Norwegian Food Safety Authority. The information has been reported by the waterworks through application processing or other reporting to the Norwegian Food Safety Authority. Drinking water regulations require, among other things, annual reporting. The Norwegian Food Safety Authority has created a separate form service for such reporting. The data sets include public or private waterworks that supply 50 people or more. In addition, all municipal owned businesses with their own water supply are included regardless of size. The data sets also contain decommissioned facilities. This is done for those who wish to view historical data, i.e. data for previous years or earlier. There are data sets for the following supervisory objects: 1. Water supply system. It also includes analysis of drinking water. 2. Transport system 3. Treatment facility 4. Entry point. It also includes analysis of the water source. Below you will find datasets for: 1. Water supply system_reporting In addition, there is a file (information.txt) that provides an overview of when the extracts were produced and how many lines there are in the individual files. The withdrawals are done weekly. Furthermore, for the data sets water supply system, transport system and intake point it is possible to see historical data on what is included in the annual reporting. To make use of that information, the file must be linked to the “moder” file. to get names and other static information. These files have the _reporting ending in the file name. Description of the data fields (i.e. metadata) in the individual data sets appears in separate files. These are available in pdf format. If you double-click the csv file and it opens directly in excel, then you will not get the æøå. To see the character set correctly in Excel, you must: & start Excel and a new spreadsheet & select data and then from text, press Import & select separator data and file origin 65001: Unicode (UTF-8) and tick of My Data have headings and press Next & remove tab as separator and select semicolon as separator, press next & otherwise, complete the data sets can be imported into a separate database and compiled as desired. There are link keys in the files that make it possible to link the files together. The waterworks are responsible for the quality of the datasets.
—
Purpose: Make data for drinking water supply available to the public.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset for the paper: Understanding the Issues, Their Causes and Solutions in Microservices Systems: An Empirical Study. The dataset is recorded in an MS Excel file which contains the following Excel sheets, and the description of each sheet is briefly presented below.
(1) Selected Systems
contains the 15 selected open source microservices systems with the color code and URL of each system.
(2) Raw Data
contains the information of initially retrieved 10,222 issues, including issue titles, issue links, issue open date, issue closed date, and the number of participants in each issue discussion.
(3) Screened Issues
contains the issues that meet the initial selection criteria (i.e., 5,115 issues) and the issues that do not meet the initial selection criteria (i.e., 5,107 issues).
(4) Selected Issues (Round 1)
contains the list of 5,115 issues that meet the initial selection criteria.
(5) Selected Issues (Round 2)
contains the issues related to RQs (i.e., 2,641 issues) and the issues not related to RQs (i.e., 2,474 issues).
(6) Selected Issues
contains the list of selected 2,641 issues, which were used to answer the RQs.
(7) Initial Codes
contains the initial codes for identifying the types of issues, causes, and solutions. We used these codes to further generate the subcategories and categories of issues, causes, and solutions.
(8) Interview Questionnaire
contains the interview questions we asked microservices practitioners to identify any missing issues, causes, and solutions, as well as to improve the proposed taxonomies.
(9) Interview Results
contains the results of interviews that we conducted to confirm and improve the developed taxonomies of issues, causes, and solutions.
(10) Survey Questionnaire
contains the survey questions we asked microservices practitioners through a Web-based survey to validate our taxonomies of issues, causes, and solutions.
(11) Issue Taxonomy
contains the detailed issue taxonomy consisting of 19 categories, 54 subcategories, and 402 types of issues.
(12) Cause Taxonomy
contains the detailed cause taxonomy consisting of 8 categories, 26 subcategories, and 228 types of causes.
(13) Solution Taxonomy
contains the detailed solution taxonomy consisting of 8 categories, 32 subcategories, and 177 types of solutions.
Facebook
TwitterThis page provides data for the 3rd Grade Reading Level Proficiency performance measure.
The dataset includes the student performance results on the English/Language Arts section of the AzMERIT from the Fall 2017 and Spring 2018. Data is representive of students in third grade in public elementary schools in Tempe. This includes schools from both Tempe Elementary and Kyrene districts. Results are by school and provide the total number of students tested, total percentage passing and percentage of students scoring at each of the four levels of proficiency.
The performance measure dashboard is available at 3.07 3rd Grade Reading Level Proficiency.
Additional Information
Source: Arizona Department of Education
Contact: Ann Lynn DiDomenico
Contact E-Mail: Ann_DiDomenico@tempe.gov
Data Source Type: Excel/ CSV
Preparation Method: Filters on original dataset: within
"Schools" Tab School District [select Tempe School District and
Kyrene School District]; School Name [deselect Kyrene SD not in Tempe city
limits]; Content Area [select English Language Arts]; Test Level [select Grade
3]; Subgroup/Ethnicity [select All Students] Remove irrelevant fields; Add
Fiscal Year
Publish Frequency: Annually as data becomes available
Publish Method: Manual
Data Dictionary
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
By Health [source]
This dataset presents a comprehensive look into the prevalence of asthma among Californian residents in terms of emergency department visits. Using age-adjusted rates and county FIPS codes, it offers an accurate snapshot of the prevalence rates per 10,000 people and provides key insights into how this condition affects certain age groups by ZIP Code. With its easy to use associated map view, this dataset allows users to quickly gain deeper knowledge about this important health issue and craft meaningful solutions to address it
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains counts and rates of asthma related emergency department visits by ZIP Code and age group in California. This data can be useful when doing research on asthma related trends or attempting to find correlations between environmental factors, prevalence of disease and geography.
- Select a year for analysis - the latest year for which data is available is the default selection, but other years are also listed in the dropdown menu.
- Select an Age Group to analyze - use the provided dropdown menus to select one or more age groups (all ages, 0-17, 18+) if you wish to analyze two different age groups in your analysis.
- Define a geographical area by selecting a ZIP code or County Fips code from which you wish to obtain your dataset from based on its availability or importance in your research question .
- View and download relevant data - after selecting all of the desired criteria (year,Age group(s), ZIP code/County FIPS Code) click “View Data” then “Download” at the bottom right corner of window that opens up
5 Analyze information found - use software such as Microsoft Excel or open source programs like Openoffice Calc to gain insight into your downloaded dataset through statistics calculations, graphs etc.. In particular look out for anomalies that could signify further investigation
- Identifying the geographic clusters of asthma sufferers by analyzing the rate of emergency department visits with geographic mapping.
- Developing outreach initiatives to areas with a high rate of ED visits for asthma to provide education, interventions and resources designed towards increasing preventive care and reducing preventable complications due to lack of access or knowledge about available services in these communities.
- Assessing disparities in ED visit rates for asthma between age groups as well as between urban and rural areas or different socio-economic groups within counties or ZIP codes in order to identify areas where there is a need for increased interventions, services and other resources related to asthma care in order to reduce the burden or severity of this chronic condition among particularly vulnerable population groups
If you use this dataset in your research, please credit the original authors. Data Source
License: Open Database License (ODbL) v1.0 - You are free to: - Share - copy and redistribute the material in any medium or format. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices. - No Derivatives - If you remix, transform, or build upon the material, you may not distribute the modified material. - No additional restrictions - You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
File: Asthma_Emergency_Department_Visit_Rates_by_ZIP_Code.csv | Column name | Description | |:----------------------|:------------------------------------------------------------------------------------------------------------------| | Year | The year the data was collected. (Integer) | | ZIP code | The ZIP code of the area the data was collected from. (String...
Facebook
Twitterhttp://dcat-ap.de/def/licenses/other-closedhttp://dcat-ap.de/def/licenses/other-closed
The data set contains the results of the mayor’s election on 25 May 2014 and the mayor’s key election on 15 June 2014 of the City of Düsseldorf.
The local elections took place on 25 May 2014. Because no clear majority was reached, there was a runoff election of the mayor on 15 June 2014.
An authority may set up different territorial levels to present the election results, from the lowest level (voting districts) to constituencies and districts to the level of the city or municipality, district and constituency. However, not all levels are necessary for each type of election. For each of the territorial levels that an authority has set up, there is a file containing the overview of those areas with fast messages already received.
Further data sets contain information on the division of electoral areas for local elections and the division of voting districts.
Information on terms in the field of ‘Elections’ can be found in the Election ABC of the interactive learning platform for election workers of the City of Düsseldorf.
The files are encoded in UTF-8. By default, Excel does not display the umlauts in the files correctly. You can avoid this as follows:
Excel 2003 Select from the menu ‘Data’ -> ‘Import external data’ from the menu item ‘Import data’. The ‘Select data source’ dialog opens. Select the file you want to open and press the ‘Open’ button. Then place the file origin to '65001 Unicode: (UTF-8)' fixed and continue with the ‘Next’ button. In the next dialog, set the separator to ‘Semicolon’ instead of ‘Tabstopp’ and continue with the ‘Next’ button again. They then select the ‘Text’ option as the data format of the columns and exit the wizard with the ‘Finish’ button. Use the ‘OK’ button to finish the procedure and the data is displayed UTF-8 encoded in Microsoft Excel.
Excel 2010 From the tab ‘Data’ in the section ‘Retrieve external data’, select the option ‘From text’. The dialog ‘Import text file’ opens. Select the file you want to open and press the ‘Open’ button. Then place the file origin to '65001 Unicode: (UTF-8)' fixed and continue with the ‘Next’ button. In the next dialog, set the separator to ‘Semicolon’ instead of ‘Tabstopp’ and continue with the ‘Next’ button again. They then select the ‘Text’ option as the data format of the columns and exit the wizard with the ‘Finish’ button. Use the ‘OK’ button to finish the procedure and the data is displayed UTF-8 encoded in Microsoft Excel.
The files contain the following column information:
Number: Constituency number Name: Name of the constituency MaxQuick Messages: maximum number of quick messages AnzQuick Messages: Number of fast messages already recorded Eligible voters: Number of eligible voters Filed under: Number of ballot papers submitted Turnout: Voter turnouts at the respective view levels valid Voting List: Number of valid ballot papers valid: Number of valid votes cast invalid Voting List: Number of invalid ballot papers invalid: Number of invalid votes cast In addition, the following fields are available for each party (example of one party called ‘A Party’):
A Party: Number of total votes of the party A-Party_Proz: Percentage of total votes of the party from the total result
Facebook
Twitterhttps://dataverse-staging.rdmc.unc.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=hdl:1902.29/CD-10849https://dataverse-staging.rdmc.unc.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=hdl:1902.29/CD-10849
"The Statistical Abstract of the United States, published since 1878, is the standard summary of statistics on the social, political, and economic organization of the United States. It is designed to serve as a convenient volume for statistical reference and as a guide to other statistical publications and sources. The latter function is served by the introductory text to each section, the source note appearing below each table, and Appendix I, which comprises the Guide to Sources of Statisti cs, the Guide to State Statistical Abstracts, and the Guide to Foreign Statistical Abstracts. The Statistical Abstract sections and tables are compiled into one Adobe PDF named StatAbstract2009.pdf. This PDF is bookmarked by section and by table and can be searched using the Acrobat Search feature. The Statistical Abstract on CD-ROM is best viewed using Adobe Acrobat 5, or any subsequent version of Acrobat or Acrobat Reader. The Statistical Abstract tables and the metropolitan areas tables from Appendix II are available as Excel(.xls or .xlw) spreadsheets. In most cases, these spreadsheet files offer the user direct access to more data than are shown either in the publication or Adobe Acrobat. These files usually contain more years of data, more geographic areas, and/or more categories of subjects than those shown in the Acrobat version. The extensive selection of statistics is provided for the United States, with selected data for regions, divisions, states, metropolitan areas, cities, and foreign countries from reports and records of government and private agencies. Software on the disc can be used to perform full-text searches, view official statistics, open tables as Lotus worksheets or Excel workbooks, and link directly to source agencies and organizations for supporting information. Except as indicated, figures are for the United States as presently constituted. Although emphasis in the Statistical Abstract is primarily given to national data, many tables present data for regions and individual states and a smaller number for metropolitan areas and cities.Statistics for the Commonwealth of Puerto Rico and for island areas of the United States are included in many state tables and are supplemented by information in Section 29. Additional information for states, cities, counties, metropolitan areas, and other small units, as well as more historical data are available in various supplements to the Abstract. Statistics in this edition are generally for the most recent year or period available by summer 2006. Each year over 1,400 tables and charts are reviewed and evaluated; new tables and charts of current interest are added, continuing series are updated, and less timely data are condensed or eliminated. Text notes and appendices are revised as appropriate. This year we have introduced 72 new tables covering a wide range of subject areas. These cover a variety of topics including: learning disability for children, people impacted by the hurricanes in the Gulf Coast area, employees with alternative work arrangements, adult computer and Internet users by selected characteristics, North America cruise industry, women- and minority-owned businesses, and the percentage of the adult population considered to be obese. Some of the annually surveyed topics are population; vital statistics; health and nutrition; education; law enforcement, courts and prison; geography and environment; elections; state and local government; federal government finances and employment; national defense and veterans affairs; social insurance and human services; labor force, employment, and earnings; income, expenditures, and wealth; prices; business enterprise; science and technology; agriculture; natural resources; energy; construction and housing; manufactures; domestic trade and services; transportation; information and communication; banking, finance, and insurance; arts, entertainment, and recreation; accommodation, food services, and other services; foreign commerce and aid; outlying areas; and comparative international statistics." Note to Users: This CD is part of a collection located in the Data Archive of the Odum Institute for Research in Social Science, at the University of North Carolina at Chapel Hill. The collection is located in Room 10, Manning Hall. Users may check the CDs out subscribing to the honor system. Items can be checked out for a period of two weeks. Loan forms are located adjacent to the collection.
Facebook
Twitter🚀 Bulk Data Provider – Your Trusted Source for Verified B2B & B2C Databases in IndiaMeta Description:Looking for a reliable bulk data provider? Get verified B2B and B2C databases for marketing, telecalling, and lead generation from India’s leading source—Bulk Data…
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
HelpSteer is an Open-Source dataset designed to empower AI Alignment through the support of fair, team-oriented annotation. The dataset provides 37,120 samples each containing a prompt and response along with five human-annotated attributes ranging between 0 and 4; with higher results indicating better quality. Using cutting-edge methods in machine learning and natural language processing in combination with the annotation of data experts, HelpSteer strives to create a set of standardized values that can be used to measure alignment between human and machine interactions. With comprehensive datasets providing responses rated for correctness, coherence, complexity, helpfulness and verbosity, HelpSteer sets out to assist organizations in fostering reliable AI models which ensure more accurate results thereby leading towards improved user experience at all levels
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use HelpSteer: An Open-Source AI Alignment Dataset
HelpSteer is an open-source dataset designed to help researchers create models with AI Alignment. The dataset consists of 37,120 different samples each containing a prompt, a response and five human-annotated attributes used to measure these responses. This guide will give you a step-by-step introduction on how to leverage HelpSteer for your own projects.
Step 1 - Choosing the Data File
Helpsteer contains two data files – one for training and one for validation. To start exploring the dataset, first select the file you would like to use by downloading both train.csv and validation.csv from the Kaggle page linked above or getting them from the Google Drive repository attached here: [link]. All the samples in each file consist of 7 columns with information about a single response: prompt (given), response (submitted), helpfulness, correctness, coherence, complexity and verbosity; all sporting values between 0 and 4 where higher means better in respective category.
## Step 2—Exploratory Data Analysis (EDA) Once you have your file loaded into your workspace or favorite software environment (e.g suggested libraries like Pandas/Numpy or even Microsoft Excel), it’s time explore it further by running some basic EDA commands that summarize each feature's distribution within our data set as well as note potential trends or points of interests throughout it - e.g what are some traits that are polarizing these responses more? Are there any outliers that might signal something interesting happening? Plotting these results often provides great insights into pattern recognition across datasets which can be used later on during modeling phase also known as “Feature Engineering”
## Step 3—Data Preprocessing After your interpretation of raw data while doing EDA should form some hypotheses around what features matter most when trying to estimate attribute scores of unknown responses accurately so proceeding with preprocessing such as cleaning up missing entries or handling outliers accordingly becomes highly recommended before starting any modelling efforts with this data set - kindly refer also back at Kaggle page description section if unsure about specific attributes domain ranges allowed values explicitly for extra confidence during this step because having correct numerical suggestions ready can make modelling workload lighter later on while building predictive models . It’s important not rushing over this stage otherwise poor results may occur later when aiming high accuracy too quickly upon model deployment due low quality
- Designating and measuring conversational AI engagement goals: Researchers can utilize the HelpSteer dataset to design evaluation metrics for AI engagement systems.
- Identifying conversational trends: By analyzing the annotations and data in HelpSteer, organizations can gain insights into what makes conversations more helpful, cohesive, complex or consistent across datasets or audiences.
- Training Virtual Assistants: Train artificial intelligence algorithms on this dataset to develop virtual assistants that respond effectively to customer queries with helpful answers
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons.org/pu...
Facebook
TwitterThis dataset contains all current and active business licenses issued by the Department of Business Affairs and Consumer Protection. This dataset contains a large number of records /rows of data and may not be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Notepad or Wordpad, to view and search.
Data fields requiring description are detailed below.
APPLICATION TYPE: 'ISSUE' is the record associated with the initial license application. 'RENEW' is a subsequent renewal record. All renewal records are created with a term start date and term expiration date. 'C_LOC' is a change of location record. It means the business moved. 'C_CAPA' is a change of capacity record. Only a few license types my file this type of application. 'C_EXPA' only applies to businesses that have liquor licenses. It means the business location expanded.
LICENSE STATUS: 'AAI' means the license was issued.
Business license owners may be accessed at: http://data.cityofchicago.org/Community-Economic-Development/Business-Owners/ezma-pppn To identify the owner of a business, you will need the account number or legal name.
Data Owner: Business Affairs and Consumer Protection
Time Period: Current
Frequency: Data is updated daily
Facebook
TwitterBy data.world's Admin [source]
This dataset contains data used to analyze the uniquely popular business types in the neighborhoods of Seattle and New York City. We used publically available neighborhood-level shapefiles to identify neighborhoods, and then crossed that information against Yelp's Business Category API to find businesses operating within each neighborhood. The ratio of businesses from each category was studied in comparison to their ratios in the entire city to determine any significant differences between each borough.
Any single business with more than one category was repeated for each one, however none of them were ever recorded twice for any single category. Moreover, if a certain business type didn't make up at least 1% of a particular neighborhood's businesses overall it was removed from the analysis altogether.
The data available here is free to use under MIT license, with appropriate attribution given back to Yelp for providing this information. It is an invaluable resource for researchers across different disciplines looking into consumer behavior or clustering within urban areas!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use This Dataset
To get started using this dataset: - Download the appropriate file for the area you’re researching - either salt5_Seattle.csv or top5_NewYorkCity.csv - from the Kaggle site which hosts this dataset (https://www.kaggle.com/puddingmagazine/uniquely-popular-businesses). - Read through each columns information available under Columns section associated with this kaggle description (above).
- Take note of columns that are relevant to your analysis such as nCount which indicates the number of businesses in a neighborhood, rank which shows how popular that business type is overall and neighborhoodTotal which specifies total number of businesses in a particular neighborhood etc.,
- ) Load your selected file into an application designed for data analysis such as Jupyter Notebook, Microsoft Excel, Power BI etc.,
- ) Begin performing various analyses related to understanding where certain types of unique business are most common by subsetting rows based on specific neighborhoods; alternatively perform regressions-based analyses related to trends similar unique type's ranks over multiple neighborhoods etc.,If you have any questions about interpreting data from this source please reach out if needed!
- Analyzing the unique business trends in Seattle and New York City to identify potential investment opportunities.
- Creating a tool that helps businesses understand what local competitions they face by neighborhood.
- Exploring the distinctions between neighborhoods by plotting out the different businesses they have in comparison with each other and other cities
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: top5_Seattle.csv | Column name | Description | |:----------------------|:----------------------------------------------------------------------------------------------------------------------------------| | neighborhood | Name of the neighborhood. (String) | | yelpAlias | The Yelp-specified Alias for the business type. (String) | | yelpTitle | The Title given to this business type by Yelp. (String) | | nCount | Number of businesses with this type within a particular neighborhood. (Integer) | | neighborhoodTotal | Total number of businesses located within that particular region. (Integer) | | cCount | Number of businesses with this storefront within an entire city. (Integer) | | cityTotal | Total number of all types of storefronts within an entire city. (Integer) ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Yelp Reviews Polarity dataset is a collection of Yelp reviews that have been labeled as positive or negative. This dataset is perfect for natural language processing tasks such as sentiment analysis
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This YELP reviews dataset is a great natural language processing dataset for anyone looking to get started with text classification. The data is split into two files: train.csv and test.csv. The training set contains 7,000 reviews with labels (0 = negative, 1 = positive), and the test set contains 3,000 unlabeled reviews.
To get started with this dataset, download the two CSV files and put them in the same directory. Then, open up train.csv in your favorite text editor or spreadsheet software (I like using Microsoft Excel). Next, take a look at the first few rows of data to get a feel for what you're working with:
text label So there is no way for me to plug it in here in the US unless I go by... 0
- This dataset could be used to train a machine learning model to classify Yelp reviews as positive or negative.
- This dataset could be used to train a machine learning model to predict the star rating of a Yelp review based on the text of the review.
- This dataset could be used to build a natural language processing system that generates fake Yelp reviews
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |
File: test.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Project: Data Analysis using Excel Pivot Tables & Charts
Based on the analysis of 6,607 students, this project identifies that active student habits (Attendance, Tutoring) are stronger predictors of success than environmental factors (Income, Resources).
Facebook
TwitterIntroduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.
Section 1 - Ask:
A. Guiding Questions:
1. Who are the key stakeholders and what are their goals for the data analysis project?
2. What is the business task that this data analysis project is attempting to solve?
B. Key Tasks: 1. Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team.
Section 2 - Prepare:
A. Guiding Questions: 1. Where is the data stored and organized? 2. Are there any problems with the data? 3. How does the data help answer the business question?
B. Key Tasks:
Research and communicate the source of the data, and how it is stored/organized to stakeholders.
*The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016.
*Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were:
-sleepDay_merged.csv
-dailyActivity_merged.csv
Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual ...
Facebook
TwitterThis data set includes soil temperature data from boreholes located at five stations in Russia: Yakutsk, Verkhoyansk, Pokrovsk, Isit', and Churapcha. The data have been compiled into five Microsoft Excel files, one for each station. Each Excel file contains three worksheets:
Facebook
TwitterThe intention is to collect data for the calendar year 2009 (or the nearest year for which each business keeps its accounts. The survey is considered a one-off survey, although for accurate NAs, such a survey should be conducted at least every five years to enable regular updating of the ratios, etc., needed to adjust the ongoing indicator data (mainly VAGST) to NA concepts. The questionnaire will be drafted by FSD, largely following the previous BAS, updated to current accounting terminology where necessary. The questionnaire will be pilot tested, using some accountants who are likely to complete a number of the forms on behalf of their business clients, and a small sample of businesses. Consultations will also include Ministry of Finance, Ministry of Commerce, Industry and Labour, Central Bank of Samoa (CBS), Samoa Tourism Authority, Chamber of Commerce, and other business associations (hotels, retail, etc.).
The questionnaire will collect a number of items of information about the business ownership, locations at which it operates and each establishment for which detailed data can be provided (in the case of complex businesses), contact information, and other general information needed to clearly identify each unique business. The main body of the questionnaire will collect data on income and expenses, to enable value added to be derived accurately. The questionnaire will also collect data on capital formation, and will contain supplementary pages for relevant industries to collect volume of production data for selected commodities and to collect information to enable an estimate of value added generated by key tourism activities.
The principal user of the data will be FSD which will incorporate the survey data into benchmarks for the NA, mainly on the current published production measure of GDP. The information on capital formation and other relevant data will also be incorporated into the experimental estimates of expenditure on GDP. The supplementary data on volumes of production will be used by FSD to redevelop the industrial production index which has recently been transferred under the SBS from the CBS. The general information about the business ownership, etc., will be used to update the Business Register.
Outputs will be produced in a number of formats, including a printed report containing descriptive information of the survey design, data tables, and analysis of the results. The report will also be made available on the SBS website in “.pdf” format, and the tables will be available on the SBS website in excel tables. Data by region may also be produced, although at a higher level of aggregation than the national data. All data will be fully confidentialised, to protect the anonymity of all respondents. Consideration may also be made to provide, for selected analytical users, confidentialised unit record files (CURFs).
A high level of accuracy is needed because the principal purpose of the survey is to develop revised benchmarks for the NA. The initial plan was that the survey will be conducted as a stratified sample survey, with full enumeration of large establishments and a sample of the remainder.
National Coverage
The main statistical unit to be used for the survey is the establishment. For simple businesses that undertake a single activity at a single location there is a one-to-one relationship between the establishment and the enterprise. For large and complex enterprises, however, it is desirable to separate each activity of an enterprise into establishments to provide the most detailed information possible for industrial analysis. The business register will need to be developed in such a way that records the links between establishments and their parent enterprises. The business register will be created from administrative records and may not have enough information to recognize all establishments of complex enterprises. Large businesses will be contacted prior to the survey post-out to determine if they have separate establishments. If so, the extended structure of the enterprise will be recorded on the business register and a questionnaire will be sent to the enterprise to be completed for each establishment.
SBS has decided to follow the New Zealand simplified version of its statistical units model for the 2009 BAS. Future surveys may consider location units and enterprise groups if they are found to be useful for statistical collections.
It should be noted that while establishment data may enable the derivation of detailed benchmark accounts, it may be necessary to aggregate up to enterprise level data for the benchmarks if the ongoing data used to extrapolate the benchmark forward (mainly VAGST) are only available at the enterprise level.
The BAS's covered all employing units, and excluded small non-employing units such as the market sellers. The surveys also excluded central government agencies engaged in public administration (ministries, public education and health, etc.). It only covers businesses that pay the VAGST. (Threshold SAT$75,000 and upwards).
Sample survey data [ssd]
-Total Sample Size was 1240 -Out of the 1240, 902 successfully completed the questionnaire. -The other remaining 338 either never responded or were omitted (some businesses were ommitted from the sample as they do not meet the requirement to be surveyed) -Selection was all employing units paying VAGST (Threshold SAT $75,000 upwards)
WILL CONFIRM LATER!!
OSO LE MEA E LE FAASA...AEA :-)
Mail Questionnaire [mail]
Supplementary Pages Additional pages have been prepared to collect data for a limited range of industries. 1.Production data. To rebase and redevelop the Industrial Production Index (IPI), it is intended to collect volume of production information from a selection of large manufacturing businesses. The selection of businesses and products is critical to the usefulness of the IPI. The products must be homogeneous, and be of enough importance to the economy to justify collecting the data. Significance criteria should be established for the selection of products to include in the IPI, and the 2009 BAS provides an opportunity to collect benchmark data for a range of products known to be significant (based on information in the existing IPI, CPI weights, export data, etc.) as well as open questions for respondents to provide information on other significant products. 2.Tourism. There is a strong demand for estimates of tourism value added. To estimate tourism value added using the international standard Tourism Satellite Account methodology requires the use of an input-output table, which is beyond the capacity of SBS at present. However, some indicative estimates of the main parts of the economy influenced by tourism can be derived if the necessary data are collected. Tourism is a demand concept, based on defining tourists (the international standard includes both international and domestic tourists), what products are characteristically purchased by tourists, and which industries supply those products. Some questions targeted at those industries that have significant involvement with tourists (hotels, restaurants, transport and tour operators, vehicle hire, etc.), on how much of their income is sourced from tourism would provide valuable indicators of the size of the direct impact of tourism.
Partial imputation was done at the time of receipt of questionnaires, after follow-up procedures to obtain fully completed questionnaires have been followed. Imputation followed a process, i.e., apply ratios from responding units in the imputation cell to the partial data that was supplied. Procedures were established during the editing stage (a) to preserve the integrity of the questionnaires as supplied by respondents, and (b) to record all changes made to the questionnaires during editing. If SBS staff writes on the form, for example, this should only be done in red pen, to distinguish the alterations from the original information.
Additional edit checks were developed, including checking against external data at enterprise/establishment level. External data to be checked against include VAGST and SNPF for turnover and purchases, and salaries and wages and employment data respectively. Editing and imputation processes were undertaken by FSD using Excel.
NOT APPLICABLE!!
Facebook
Twitterhttp://spdx.org/licenses/NLOD-2.0http://spdx.org/licenses/NLOD-2.0
The data sets provide an overview of selected data on waterworks registered with the Norwegian Food Safety Authority. The information has been reported by the waterworks through application processing or other reporting to the Norwegian Food Safety Authority. Drinking water regulations require, among other things, annual reporting. The Norwegian Food Safety Authority has created a separate form service for such reporting. The data sets include public or private waterworks that supply 50 people or more. In addition, all municipal owned businesses with their own water supply are included regardless of size. The data sets also contain decommissioned facilities. This is done for those who wish to view historical data, i.e. data for previous years or earlier. There are data sets for the following supervisory objects: 1. Water supply system. It also includes analysis of drinking water. 2. Transport system 3. Treatment facility 4. Entry point. It also includes analysis of the water source. Below you will find datasets for: 4. Input point_reporting. In addition, there is a file (information.txt) that provides an overview of when the extracts were produced and how many lines there are in the individual files. The withdrawals are done weekly. Furthermore, for the data sets water supply system, transport system and intake point it is possible to see historical data on what is included in the annual reporting. To make use of that information, the file must be linked to the “moder” file. to get names and other static information. These files have the _reporting ending in the file name. Description of the data fields (i.e. metadata) in the individual data sets appears in separate files. These are available in pdf format. If you double-click the csv file and it opens directly in excel, then you will not get the æøå. To see the character set correctly in Excel, you must: & start Excel and a new spreadsheet & select data and then from text, press Import & select separator data and file origin 65001: Unicode (UTF-8) and tick of My Data have headings and press Next & remove tab as separator and select semicolon as separator, press next & otherwise, complete the data sets can be imported into a separate database and compiled as desired. There are link keys in the files that make it possible to link the files together. The waterworks are responsible for the quality of the datasets.
—
Purpose: Make data for drinking water supply available to the public.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe evaluation of surveillance systems has been recommended by the World Health Organization (WHO) to identify the performance and areas for improvement. Universal salt iodization (USI) as one of the surveillance systems in Tanzania needs periodic evaluation for its optimal function. This study aimed at evaluating the universal salt iodization (USI) surveillance system in Tanzania from January to December 2021 to find out if the system meets its intended objectives by evaluating its attributes as this was the first evaluation of the USI surveillance system since its establishment in 2010. The USI surveillance system is key for monitoring the performance towards the attainment of universal salt iodization (90%).MethodologyThis evaluation was guided by the Center for Disease Control Guidelines for Evaluating Public Health Surveillance Systems, (MMWR) to evaluate USI 2021 data. The study was conducted in Kigoma region in March 2022. Both Purposive and Convenient sampling was used to select the region, district, and ward for the study. The study involved reviewing documents used in the USI system and interviewing the key informants in the USI program. Data analysis was done by Microsoft Excel and presented in tables and graphs.ResultsA total of 1715 salt samples were collected in the year 2021 with 279 (16%) of non-iodized salt identified. The majority of the system attributes 66.7% had a good performance with a score of three, 22.2% had a moderate performance with a score of two and one attribute with poor performance with a score of one. Data quality, completeness and sensitivity were 100%, acceptability 91.6%, simplicity 83% were able to collect data on a single sample in < 2 minutes, the system stability in terms of performance was >75% and the usefulness of the system had poor performance.ConclusionAlthough the system attributes were found to be working overall well, for proper surveillance of the USI system, the core attributes need to be strengthened. Key variables that measure the system performance must be included from the primary data source and well-integrated with the Local Government (district and regions) to Ministry of Health information systems.
Facebook
TwitterExcel files containing source data for habitat selection
Facebook
TwitterTempe’s trust data for this measure is collected every month and comes from the “Safety” result from the monthly administered Police Sentiment Survey. There is one question which feeds into these results: "When it comes to the threat of crime, how safe do you feel in your neighborhood?" Benchmark data is from cohorts of communities with similar characteristics, such as size, population density, and region. This data is collected every month and quarter via a recurring report.This page provides data for the Feeling of Safety in Your Neighborhood performance measure. The performance measure dashboard is available at 1.05 Feeling of Safety in Your Neighborhood.Data Dictionary Additional Information Source: Zencity Contact: Amber Asburry Contact email: strategic_management_innovation@tempe.gov Data Source Type: Excel, CSV Preparation Method: Take the "Safety" score from the Police Sentiment Survey. This score includes the average of the top two results from the question underneath this area on the report. These months are then averaged to get the quarterly score. Publish Frequency: Monthly Publish Method: Manual
Facebook
TwitterThis dataset contains the geographic data used to create maps for the San Diego County Regional Equity Indicators Report led by the Office of Equity and Racial Justice (OERJ). The full report can be found here: https://data.sandiegocounty.gov/stories/s/7its-kgpt
Demographic data from the report can be found here: https://data.sandiegocounty.gov/dataset/Equity-Report-Data-Demographics/q9ix-kfws
Filter by the Indicator column to select data for a particular indicator map.
Export notes: Dataset may not automatically open correctly in Excel due to geospatial data. To export the data for geospatial analysis, select Shapefile or GEOJSON as the file type. To view the data in Excel, export as a CSV but do not open the file. Then, open a blank Excel workbook, go to the Data tab, select “From Text/CSV,” and follow the prompts to import the CSV file into Excel. Alternatively, use the exploration options in "View Data" to hide the geographic column prior to exporting the data.
USER NOTES: 4/7/2025 - The maps and data have been removed for the Health Professional Shortage Areas indicator due to inconsistencies with the data source leading to some missing health professional shortage areas. We are working to fix this issue, including exploring possible alternative data sources.
5/21/2025 - The following changes were made to the 2023 report data (Equity Report Year = 2023). Self-Sufficiency Wage - a typo in the indicator name was fixed (changed sufficienct to sufficient) and the percent for one PUMA corrected from 56.9 to 59.9 (PUMA = San Diego County (Northwest)--Oceanside City & Camp Pendleton). Notes were made consistent for all rows where geography = ZCTA. A note was added to all rows where geography = PUMA. Voter registration - label "92054, 92051" was renamed to be in numerical order and is now "92051, 92054". Removed data from the percentile column because the categories are not true percentiles. Employment - Data was corrected to show the percent of the labor force that are employed (ages 16 and older). Previously, the data was the percent of the population 16 years and older that are in the labor force. 3- and 4-Year-Olds Enrolled in School - percents are now rounded to one decimal place. Poverty - the last two categories/percentiles changed because the 80th percentile cutoff was corrected by 0.01 and one ZCTA was reassigned to a different percentile as a result. Low Birthweight - the 33th percentile label was corrected to be written as the 33rd percentile. Life Expectancy - Corrected the category and percentile assignment for SRA CENTRAL SAN DIEGO. Parks and Community Spaces - corrected the category assignment for six SRAs.
5/21/2025 - Data was uploaded for Equity Report Year 2025. The following changes were made relative to the 2023 report year. Adverse Childhood Experiences - added geographic data for 2025 report. No calculation of bins nor corresponding percentiles due to small number of geographic areas. Low Birthweight - no calculation of bins nor corresponding percentiles due to small number of geographic areas.
Prepared by: Office of Evaluation, Performance, and Analytics and the Office of Equity and Racial Justice, County of San Diego, in collaboration with the San Diego Regional Policy & Innovation Center (https://www.sdrpic.org).