https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
👍 If this dataset was useful to you, leave your vote at the top of the page 👍
The dataset provides information on the daily top 200 tracks listened to by users of the Spotify digital platform around the world.
I put together this dataset because I really love music (I listen to it for several hours a day) and have not found a similar dataset with track genres on kaggle.
The dataset can be useful for beginners in the field of working with data. It contains missing values, arrays in columns, and so on, which can be great practice when conducting an EDA phase.
Soon, my example will appear here as possible, based on the specified dataset, go on a musical journey around the world and understand how the musical tastes of humanity have changed around the world)))
In addition, I will be very happy to see the work of the community on this dataset.
Also, in case of interest in data by country, I am ready to place it upon request.
You can contact me through: telegram @natarov_ivan
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.
The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).
Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.
Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!
The code on GitHub .
All procedure is done in 5 stages:
Data is retrieved directly from HTML elements on the page with the selenium tool on python.
After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.
Data were arranged into a table and saved to CSV.
Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.
Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.
The procedure to fetch the data takes 7 minutes on average.
This project and code were born from this GitHub code.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Young People Survey’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/miroslavsabo/young-people-survey on 30 September 2021.
--- Dataset description provided by original source is as follows ---
In 2013, students of the Statistics class at "https://fses.uniba.sk/en/">FSEV UK were asked to invite their friends to participate in this survey.
responses.csv
) consists of 1010 rows and 150 columns (139
integer and 11 categorical).columns.csv
file if you want to match the data with the original names.The variables can be split into the following groups:
Many different techniques can be used to answer many questions, e.g.
(in slovak) Sleziak, P. - Sabo, M.: Gender differences in the prevalence of specific phobias. Forum Statisticum Slovacum. 2014, Vol. 10, No. 6. [Differences (gender + whether people lived in village/town) in the prevalence of phobias.]
Sabo, Miroslav. Multivariate Statistical Methods with Applications. Diss. Slovak University of Technology in Bratislava, 2014. [Clustering of variables (music preferences, movie preferences, phobias) + Clustering of people w.r.t. their interests.]
a MOCK dataset used to show how to import Qualtrics metadata into the codebook R package
This table contains variable names, labels, and number of missing values. See the complete codebook for more.
name | label | n_missing |
---|---|---|
ResponseSet | NA | 0 |
Q7 | NA | 0 |
Q10 | NA | 0 |
This dataset was automatically described using the codebook R package (version 0.9.5).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is updated more frequently and can be visualized on NCWQR's data portal.
If you have any questions, please contact Dr. Laura Johnson or Dr. Nathan Manning.
The National Center for Water Quality Research (NCWQR) is a research laboratory at Heidelberg University in Tiffin, Ohio, USA. Our primary research program is the Heidelberg Tributary Loading Program (HTLP), where we currently monitor water quality at 22 river locations throughout Ohio and Michigan, effectively covering ~half of the land area of Ohio. The goal of the program is to accurately measure the total amounts (loads) of pollutants exported from watersheds by rivers and streams. Thus these data are used to assess different sources (nonpoint vs point), forms, and timing of pollutant export from watersheds. The HTLP officially began with high-frequency monitoring for sediment and nutrients from the Sandusky and Maumee rivers in 1974, and has continually expanded since then.
Each station where samples are collected for water quality is paired with a US Geological Survey gage for quantifying discharge (http://waterdata.usgs.gov/usa/nwis/rt). Our stations cover a wide range of watershed areas upstream of the sampling point from 11.0 km2 for the unnamed tributary to Lost Creek to 19,215 km2 for the Muskingum River. These rivers also drain a variety of land uses, though a majority of the stations drain over 50% row-crop agriculture.
At most sampling stations, submersible pumps located on the stream bottom continuously pump water into sampling wells inside heated buildings where automatic samplers collect discrete samples (4 unrefrigerated samples/d at 6-h intervals, 1974–1987; 3 refrigerated samples/d at 8-h intervals, 1988-current). At weekly intervals the samples are returned to the NCWQR laboratories for analysis. When samples either have high turbidity from suspended solids or are collected during high flow conditions, all samples for each day are analyzed. As stream flows and/or turbidity decreases, analysis frequency shifts to one sample per day. At the River Raisin and Muskingum River, a cooperator collects a grab sample from a bridge at or near the USGS station approximately daily and all samples are analyzed. Each sample bottle contains sufficient volume to support analyses of total phosphorus (TP), dissolved reactive phosphorus (DRP), suspended solids (SS), total Kjeldahl nitrogen (TKN), ammonium-N (NH4), nitrate-N and nitrite-N (NO2+3), chloride, fluoride, and sulfate. Nitrate and nitrite are commonly added together when presented; henceforth we refer to the sum as nitrate.
Upon return to the laboratory, all water samples are analyzed within 72h for the nutrients listed below using standard EPA methods. For dissolved nutrients, samples are filtered through a 0.45 um membrane filter prior to analysis. We currently use a Seal AutoAnalyzer 3 for DRP, silica, NH4, TP, and TKN colorimetry, and a DIONEX Ion Chromatograph with AG18 and AS18 columns for anions. Prior to 2014, we used a Seal TRAACs for all colorimetry.
2017 Ohio EPA Project Study Plan and Quality Assurance Plan
Data quality control and data screening
The data provided in the River Data files have all been screened by NCWQR staff. The purpose of the screening is to remove outliers that staff deem likely to reflect sampling or analytical errors rather than outliers that reflect the real variability in stream chemistry. Often, in the screening process, the causes of the outlier values can be determined and appropriate corrective actions taken. These may involve correction of sample concentrations or deletion of those data points.
This micro-site contains data for approximately 126,000 water samples collected beginning in 1974. We cannot guarantee that each data point is free from sampling bias/error, analytical errors, or transcription errors. However, since its beginnings, the NCWQR has operated a substantial internal quality control program and has participated in numerous external quality control reviews and sample exchange programs. These programs have consistently demonstrated that data produced by the NCWQR is of high quality.
A note on detection limits and zero and negative concentrations
It is routine practice in analytical chemistry to determine method detection limits and/or limits of quantitation, below which analytical results are considered less reliable or unreliable. This is something that we also do as part of our standard procedures. Many laboratories, especially those associated with agencies such as the U.S. EPA, do not report individual values that are less than the detection limit, even if the analytical equipment returns such values. This is in part because as individual measurements they may not be considered valid under litigation.
The measured concentration consists of the true but unknown concentration plus random instrument error, which is usually small compared to the range of expected environmental values. In a sample for which the true concentration is very small, perhaps even essentially zero, it is possible to obtain an analytical result of 0 or even a small negative concentration. Results of this sort are often “censored” and replaced with the statement “
Censoring these low values creates a number of problems for data analysis. How do you take an average? If you leave out these numbers, you get a biased result because you did not toss out any other (higher) values. Even if you replace negative concentrations with 0, a bias ensues, because you’ve chopped off some portion of the lower end of the distribution of random instrument error.
For these reasons, we do not censor our data. Values of -9 and -1 are used as missing value codes, but all other negative and zero concentrations are actual, valid results. Negative concentrations make no physical sense, but they make analytical and statistical sense. Users should be aware of this, and if necessary make their own decisions about how to use these values. Particularly if log transformations are to be used, some decision on the part of the user will be required.
Analyte Detection Limits
https://ncwqr.files.wordpress.com/2021/12/mdl-june-2019-epa-methods.jpg?w=1024
For more information, please visit https://ncwqr.org/
The dataset has N=354 rows and 9 columns. 354 rows have no missing values on any column.
This table contains variable names, labels, and number of missing values. See the complete codebook for more.
name | label | n_missing |
---|---|---|
municipio_cod | NA | 0 |
municipio_fato | NA | 0 |
data_fato | NA | 0 |
mes | NA | 0 |
ano | NA | 0 |
risp | NA | 0 |
rmbh | NA | 0 |
tentado_consumado | NA | 0 |
qtde_vitimas | NA | 0 |
This dataset was automatically described using the codebook R package (version 0.9.2).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.
The original data include 10 subjects, each perform 10 gestures 10 times. The gesture acquisition device is a Nintendo Wiimote remote controller with built-in three-axis accelerometer. Each subject performs a set of gestures multiple times. Classes are based on gestures (see class labels below). Note that data are shuffled and randomly sampled so that instances across datasets are not sychronized by dimension or subject. Time series are of different lengths. There are no missing values.
The gestures are (class label. original label - English translation):
poteg: pick-up
shake: shake
desno: one move to the right
levo: one move to the left
gor: one move to up
dol: one move to down
kroglevo: one left circle
krogdesn: one right circle
suneknot: one move toward the screen
sunekven: one move away from the screen
This data is acceleration in y-axis dimension.
Donator: J. Guna
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is provided in a single .xlsx file named "eucalyptus_growth_environment_data_V2.xlsx" and consists of fifteen sheets:
Codebook: This sheet details the index, values, and descriptions for each field within the dataset, providing a comprehensive guide to understanding the data structure.
ALL NODES: Contains measurements from all devices, totalling 102,916 data points. This sheet aggregates the data across all nodes.
GWD1 to GWD10: These subset sheets include measurements from individual nodes, labelled according to the abbreviation “Generic Wireless Dendrometer” followed by device IDs 1 through 10. Each sheet corresponds to a specific node, representing measurements from ten trees (or nodes).
Metadata: Provides detailed metadata for each node, including species, initial diameter, location, measurement frequency, battery specifications, and irrigation status. This information is essential for identifying and differentiating the nodes and their specific attributes.
Missing Data Intervals: Details gaps in the data stream, including start and end dates and times when data was not uploaded. It includes information on the total duration of each missing interval and the number of missing data points.
Missing Intervals Distribution: Offers a summary of missing data intervals and their distribution, providing insight into data gaps and reasons for missing data.
All nodes utilize LoRaWAN for data transmission. Please note that intermittent data gaps may occur due to connectivity issues between the gateway and the nodes, as well as maintenance activities or experimental procedures.
Software considerations: The provided R code named “Simple_Dendro_Imputation_and_Analysis.R” is a comprehensive analysis workflow that processes and analyses Eucalyptus growth and environmental data from the "eucalyptus_growth_environment_data_V2.xlsx" dataset. The script begins by loading necessary libraries, setting the working directory, and reading the data from the specified Excel sheet. It then combines date and time information into a unified DateTime format and performs data type conversions for relevant columns. The analysis focuses on a specified device, allowing for the selection of neighbouring devices for imputation of missing data. A loop checks for gaps in the time series and fills in missing intervals based on a defined threshold, followed by a function that imputes missing values using the average from nearby devices. Outliers are identified and managed through linear interpolation. The code further calculates vapor pressure metrics and applies temperature corrections to the dendrometer data. Finally, it saves the cleaned and processed data into a new Excel file while conducting dendrometer analysis using the dendRoAnalyst package, which includes visualizations and calculations of daily growth metrics and correlations with environmental factors such as vapour pressure deficit (VPD).
a small mock Big Five Inventory dataset
This table contains variable names, labels, and number of missing values. See the complete codebook for more.
name | label | n_missing |
---|---|---|
session | NA | 0 |
created | user first opened survey | 0 |
modified | user last edited survey | 0 |
ended | user finished survey | 0 |
expired | NA | 28 |
BFIK_open_2 | Ich bin tiefsinnig, denke gerne über Sachen nach. | 0 |
BFIK_agree_4R | Ich kann mich schroff und abweisend anderen gegenüber verhalten. | 0 |
BFIK_extra_2 | Ich bin begeisterungsfähig und kann andere leicht mitreißen. | 0 |
BFIK_agree_1R | Ich neige dazu, andere zu kritisieren. | 0 |
BFIK_open_1 | Ich bin vielseitig interessiert. | 0 |
BFIK_neuro_2R | Ich bin entspannt, lasse mich durch Stress nicht aus der Ruhe bringen. | 0 |
BFIK_consc_3 | Ich bin tüchtig und arbeite flott. | 0 |
BFIK_consc_4 | Ich mache Pläne und führe sie auch durch. | 0 |
BFIK_consc_2R | Ich bin bequem, neige zur Faulheit. | 0 |
BFIK_agree_3R | Ich kann mich kalt und distanziert verhalten. | 0 |
BFIK_extra_3R | Ich bin eher der "stille Typ", wortkarg. | 0 |
BFIK_neuro_3 | Ich mache mir viele Sorgen. | 0 |
BFIK_neuro_4 | Ich werde leicht nervös und unsicher. | 0 |
BFIK_agree_2 | Ich schenke anderen leicht Vertrauen, glaube an das Gute im Menschen. | 0 |
BFIK_consc_1 | Ich erledige Aufgaben gründlich. | 0 |
BFIK_open_4 | Ich schätze künstlerische und ästhetische Eindrücke. | 0 |
BFIK_extra_4 | Ich gehe aus mir heraus, bin gesellig. | 0 |
BFIK_extra_1R | Ich bin eher zurückhaltend, reserviert. | 0 |
BFIK_open_3 | Ich habe eine aktive Vorstellungskraft, bin phantasievoll. | 0 |
BFIK_agree | 4 BFIK_agree items aggregated by aggregation_function | 0 |
BFIK_open | 4 BFIK_open items aggregated by aggregation_function | 0 |
BFIK_consc | 4 BFIK_consc items aggregated by aggregation_function | 0 |
BFIK_extra | 4 BFIK_extra items aggregated by aggregation_function | 0 |
BFIK_neuro | 3 BFIK_neuro items aggregated by aggregation_function | 0 |
age | Alter | 0 |
This dataset was automatically described using the codebook R package (version 0.9.5).
KuaiRec is a real-world dataset collected from the recommendation logs of the video-sharing mobile app Kuaishou. For now, it is the first dataset that contains a fully observed user-item interaction matrix. For the term “fully observed”, we mean there are almost no missing values in the user-item matrix, i.e., each user has viewed each video and then left feedback.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘WHO national life expectancy ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mmattson/who-national-life-expectancy on 28 January 2022.
--- Dataset description provided by original source is as follows ---
I am developing my data science skills in areas outside of my previous work. An interesting problem for me was to identify which factors influence life expectancy on a national level. There is an existing Kaggle data set that explored this, but that information was corrupted. Part of the problem solving process is to step back periodically and ask "does this make sense?" Without reasonable data, it is harder to notice mistakes in my analysis code (as opposed to unusual behavior due to the data itself). I wanted to make a similar data set, but with reliable information.
This is my first time exploring life expectancy, so I had to guess which features might be of interest when making the data set. Some were included for comparison with the other Kaggle data set. A number of potentially interesting features (like air pollution) were left off due to limited year or country coverage. Since the data was collected from more than one server, some features are present more than once, to explore the differences.
A goal of the World Health Organization (WHO) is to ensure that a billion more people are protected from health emergencies, and provided better health and well-being. They provide public data collected from many sources to identify and monitor factors that are important to reach this goal. This set was primarily made using GHO (Global Health Observatory) and UNESCO (United Nations Educational Scientific and Culture Organization) information. The set covers the years 2000-2016 for 183 countries, in a single CSV file. Missing data is left in place, for the user to decide how to deal with it.
Three notebooks are provided for my cursory analysis, a comparison with the other Kaggle set, and a template for creating this data set.
There is a lot to explore, if the user is interested. The GHO server alone has over 2000 "indicators". - How are the GHO and UNESCO life expectancies calculated, and what is causing the difference? That could also be asked for Gross National Income (GNI) and mortality features. - How does the life expectancy after age 60 compare to the life expectancy at birth? Is the relationship with the features in this data set different for those two targets? - What other indicators on the servers might be interesting to use? Some of the GHO indicators are different studies with different coverage. Can they be combined to make a more useful and robust data feature? - Unraveling the correlations between the features would take significant work.
--- Original source retains full ownership of the source dataset ---
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
WARNING: This is a pre-release dataset and its fields names and data structures are subject to change. It should be considered pre-release until the end of 2024. Expected changes:Metadata is missing or incomplete for some layers at this time and will be continuously improved.We expect to update this layer roughly in line with CDTFA at some point, but will increase the update cadence over time as we are able to automate the final pieces of the process.This dataset is continuously updated as the source data from CDTFA is updated, as often as many times a month. If you require unchanging point-in-time data, export a copy for your own use rather than using the service directly in your applications.PurposeCounty and incorporated place (city) boundaries along with third party identifiers used to join in external data. Boundaries are from the authoritative source the California Department of Tax and Fee Administration (CDTFA), altered to show the counties as one polygon. This layer displays the city polygons on top of the County polygons so the area isn"t interrupted. The GEOID attribute information is added from the US Census. GEOID is based on merged State and County FIPS codes for the Counties. Abbreviations for Counties and Cities were added from Caltrans Division of Local Assistance (DLA) data. Place Type was populated with information extracted from the Census. Names and IDs from the US Board on Geographic Names (BGN), the authoritative source of place names as published in the Geographic Name Information System (GNIS), are attached as well. Finally, coastal buffers are removed, leaving the land-based portions of jurisdictions. This feature layer is for public use.Related LayersThis dataset is part of a grouping of many datasets:Cities: Only the city boundaries and attributes, without any unincorporated areasWith Coastal BuffersWithout Coastal BuffersCounties: Full county boundaries and attributes, including all cities within as a single polygonWith Coastal BuffersWithout Coastal BuffersCities and Full Counties: A merge of the other two layers, so polygons overlap within city boundaries. Some customers require this behavior, so we provide it as a separate service.With Coastal BuffersWithout Coastal Buffers (this dataset)Place AbbreviationsUnincorporated Areas (Coming Soon)Census Designated Places (Coming Soon)Cartographic CoastlinePolygonLine source (Coming Soon)Working with Coastal BuffersThe dataset you are currently viewing includes the coastal buffers for cities and counties that have them in the authoritative source data from CDTFA. In the versions where they are included, they remain as a second polygon on cities or counties that have them, with all the same identifiers, and a value in the COASTAL field indicating if it"s an ocean or a bay buffer. If you wish to have a single polygon per jurisdiction that includes the coastal buffers, you can run a Dissolve on the version that has the coastal buffers on all the fields except COASTAL, Area_SqMi, Shape_Area, and Shape_Length to get a version with the correct identifiers.Point of ContactCalifornia Department of Technology, Office of Digital Services, odsdataservices@state.ca.govField and Abbreviation DefinitionsCOPRI: county number followed by the 3-digit city primary number used in the Board of Equalization"s 6-digit tax rate area numbering systemPlace Name: CDTFA incorporated (city) or county nameCounty: CDTFA county name. For counties, this will be the name of the polygon itself. For cities, it is the name of the county the city polygon is within.Legal Place Name: Board on Geographic Names authorized nomenclature for area names published in the Geographic Name Information SystemGNIS_ID: The numeric identifier from the Board on Geographic Names that can be used to join these boundaries to other datasets utilizing this identifier.GEOID: numeric geographic identifiers from the US Census Bureau Place Type: Board on Geographic Names authorized nomenclature for boundary type published in the Geographic Name Information SystemPlace Abbr: CalTrans Division of Local Assistance abbreviations of incorporated area namesCNTY Abbr: CalTrans Division of Local Assistance abbreviations of county namesArea_SqMi: The area of the administrative unit (city or county) in square miles, calculated in EPSG 3310 California Teale Albers.COASTAL: Indicates if the polygon is a coastal buffer. Null for land polygons. Additional values include "ocean" and "bay".GlobalID: While all of the layers we provide in this dataset include a GlobalID field with unique values, we do not recommend you make any use of it. The GlobalID field exists to support offline sync, but is not persistent, so data keyed to it will be orphaned at our next update. Use one of the other persistent identifiers, such as GNIS_ID or GEOID instead.AccuracyCDTFA"s source data notes the following about accuracy:City boundary changes and county boundary line adjustments filed with the Board of Equalization per Government Code 54900. This GIS layer contains the boundaries of the unincorporated county and incorporated cities within the state of California. The initial dataset was created in March of 2015 and was based on the State Board of Equalization tax rate area boundaries. As of April 1, 2024, the maintenance of this dataset is provided by the California Department of Tax and Fee Administration for the purpose of determining sales and use tax rates. The boundaries are continuously being revised to align with aerial imagery when areas of conflict are discovered between the original boundary provided by the California State Board of Equalization and the boundary made publicly available by local, state, and federal government. Some differences may occur between actual recorded boundaries and the boundaries used for sales and use tax purposes. The boundaries in this map are representations of taxing jurisdictions for the purpose of determining sales and use tax rates and should not be used to determine precise city or county boundary line locations. COUNTY = county name; CITY = city name or unincorporated territory; COPRI = county number followed by the 3-digit city primary number used in the California State Board of Equalization"s 6-digit tax rate area numbering system (for the purpose of this map, unincorporated areas are assigned 000 to indicate that the area is not within a city).Boundary ProcessingThese data make a structural change from the source data. While the full boundaries provided by CDTFA include coastal buffers of varying sizes, many users need boundaries to end at the shoreline of the ocean or a bay. As a result, after examining existing city and county boundary layers, these datasets provide a coastline cut generally along the ocean facing coastline. For county boundaries in northern California, the cut runs near the Golden Gate Bridge, while for cities, we cut along the bay shoreline and into the edge of the Delta at the boundaries of Solano, Contra Costa, and Sacramento counties.In the services linked above, the versions that include the coastal buffers contain them as a second (or third) polygon for the city or county, with the value in the COASTAL field set to whether it"s a bay or ocean polygon. These can be processed back into a single polygon by dissolving on all the fields you wish to keep, since the attributes, other than the COASTAL field and geometry attributes (like areas) remain the same between the polygons for this purpose.SliversIn cases where a city or county"s boundary ends near a coastline, our coastline data may cross back and forth many times while roughly paralleling the jurisdiction"s boundary, resulting in many polygon slivers. We post-process the data to remove these slivers using a city/county boundary priority algorithm. That is, when the data run parallel to each other, we discard the coastline cut and keep the CDTFA-provided boundary, even if it extends into the ocean a small amount. This processing supports consistent boundaries for Fort Bragg, Point Arena, San Francisco, Pacifica, Half Moon Bay, and Capitola, in addition to others. More information on this algorithm will be provided soon.Coastline CaveatsSome cities have buffers extending into water bodies that we do not cut at the shoreline. These include South Lake Tahoe and Folsom, which extend into neighboring lakes, and San Diego and surrounding cities that extend into San Diego Bay, which our shoreline encloses. If you have feedback on the exclusion of these items, or others, from the shoreline cuts, please reach out using the contact information above.Offline UseThis service is fully enabled for sync and export using Esri Field Maps or other similar tools. Importantly, the GlobalID field exists only to support that use case and should not be used for any other purpose (see note in field descriptions).Updates and Date of ProcessingConcurrent with CDTFA updates, approximately every two weeks, Last Processed: 12/17/2024 by Nick Santos using code path at https://github.com/CDT-ODS-DevSecOps/cdt-ods-gis-city-county/ at commit 0bf269d24464c14c9cf4f7dea876aa562984db63. It incorporates updates from CDTFA as of 12/12/2024. Future updates will include improvements to metadata and update frequency.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/WIYLEHhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/WIYLEH
Originally published by Harte-Hanks, the CiTDS dataset is now produced by Aberdeen Group, a subsidiary of Spiceworks Ziff Davis (SWZD). It is also referred to as CiTDB (Computer Intelligence Technology Database). CiTDS provides data on digital investments of businesses across the globe. It includes two types of technology datasets: (i) hardware expenditures and (ii) product installs. Hardware expenditure data is constructed through a combination of surveys and modeling. A survey is administered to a number of companies and the data from surveys is used to develop a prediction model of expenditures as a function of firm characteristics. CiTDS uses this model to predict the expenditures of non-surveyed firms and reports them in the dataset. In contrast, CiTDS does not do any imputation for product install data, which comes entirely from web scraping and surveys. A confidence score between 1-3 is assigned to indicate how much the source of information can be trusted. A 3 corresponds to 90-100 percent install likelihood, 2 corresponds to 75-90 percent install likelihood and 1 corresponds to 65-75 percent install likelihood. CiTDS reports technology adoption at the site level with a unique DUNS identifier. One of these sites is identified as an “enterprise,” corresponding to the firm that owns the sites. Therefore, it is possible to analyze technology adoption both at the site (establishment) and enterprise (firm) levels. CiTDS sources the site population from Dun and Bradstreet every year and drops sites that are not relevant to their clients. Due to this sample selection, there is quite a bit of variation in the number of sites from year to year, where on average, 10-15 percent of sites enter and exit every year in the US data. This number is higher in the EU data. We observe similar turnover year-to-year in the products included in the dataset. Some products have become absolute, and some new products are added every year. There are two versions of the data: (i) version 3, which covers 2016-2020, and (ii) version 4, which covers 2020-2021. The quality of version 4 is significantly better regarding the information included about the technology products. In version 3, product categories have missing values, and they are abbreviated in a way that are sometimes difficult to interpret. Version 4 does not have any major issues. Since both versions of the data are available in 2020, CiTDS provides a crosswalk between the versions. This makes it possible to use information about products in Version 4 for the products in Version 3, with the caveats that there will be no crosswalk for the products that exist in 2016-2019 but not in 2020. Finally, special attention should be paid to data from 2016, where the coverage is significantly different from 2017. From 2017 onwards, coverage is more consistent. Years of Coverage: APac: 2019 - 2021 Canada: 2015 - 2021 EMEA: 2019 - 2021 Europe: 2015 - 2018 Latin America: 2015, 2019- 2021 United States: 2015 - 2021
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Breast Cancer Diagnostic Dataset (BCD)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/devraikwar/breast-cancer-diagnostic on 14 February 2022.
--- Dataset description provided by original source is as follows ---
The resources for this dataset can be found at https://www.openml.org/d/13 and https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.
Number of Instances: 286
Number of Attributes: 9 + the class attribute
Attribute Information:
Class: no-recurrence-events, recurrence-events age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99. menopause: lt40, ge40, premeno. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39. node-caps: yes, no. deg-malig: 1, 2, 3. breast: left, right. breast-quad: left-up, left-low, right-up, right-low, central. irradiat: yes, no.
Missing Attribute Values: (denoted by “?”) Attribute #: Number of instances with missing values: 6. 8 9. 1.
Class Distribution:
no-recurrence-events: 201 instances recurrence-events: 85 instances
Original data https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
With the attributes described above, can you predict if a patient has recurrence event ?
--- Original source retains full ownership of the source dataset ---
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Original Data from: https://archive.ics.uci.edu/ml/datasets/Heart+Disease Changes made: - four rows with missing values were removed, leaving 299 records - Chest Pain Type, Restecg, Thal variables were converted to indicator variables - class attribute binarised to -1 (no disease) / +1 disease (original values 1,2,3) Attributes: Col 0: CLASS: -1: no disease +1: disease Col 1: Age (cts) Col 2: Sex (0/1) Col 3: indicator (0/1) for typ angina Col 4: indicator for atyp angina Col 5: indicator for non-ang pain Col 6: resting blood pressure (cts) Col 7: Serum cholest (cts) Col 8: fasting blood sugar >120mg/dl (0/1) Col 9: indicator for electrocardio value 1 Col 10: indicator for electrocardio value 2 Col 11: Max heart rate (cts) Col 12: exercised induced angina (0/1) Col 13: ST depression induced by exercise (cts) Col 14: indicator for slope of peak exercise up Col 15: indicator for slope of peak exercise down Col 16: no major vessels colored by fluro (ctsish: 0,1,2,3) Col 17: Thal reversible defect indicator Col 18: Thal fixed defect indicator Col 19: Class 0-4, where 0 is disease not present, 1-4 is present
An unclean employee dataset can contain various types of errors, inconsistencies, and missing values that affect the accuracy and reliability of the data. Some common issues in unclean datasets include duplicate records, incomplete data, incorrect data types, spelling mistakes, inconsistent formatting, and outliers.
For example, there might be multiple entries for the same employee with slightly different spellings of their name or job title. Additionally, some rows may have missing data for certain columns such as bonus or exit date, which can make it difficult to analyze trends or make accurate predictions. Inconsistent formatting of data, such as using different date formats or capitalization conventions, can also cause confusion and errors when processing the data.
Furthermore, there may be outliers in the data, such as employees with extremely high or low salaries or ages, which can distort statistical analyses and lead to inaccurate conclusions.
Overall, an unclean employee dataset can pose significant challenges for data analysis and decision-making, highlighting the importance of cleaning and preparing data before analyzing it
Managing data is hard. So many of our partner institutions are under-resourced when it comes to preparing, archiving, sharing and interpreting HIV-related datasets. Crucial datasets often sit on the laptops of local staff in Excel sheets and Word documents, or in large locked-down data warehouses where only a few have the understanding to access it. But data is useless if is not accessible by trusted parties for analysis. UNAIDS has identified the following challenges faced by our local partners: Administrative burden of data management Equipment failure Staff turnover Duplication of requests for data Secure sharing of data Keeping data up-to-date A new software project has been established to tackle these challenges and streamline the data management process... The AIDS Data Repository aims to improve the quality, accessibility and consistency of HIV data and HIV estimates by providing a centralised platform with tools to help countries manage and share their HIV data. The project includes the following features: Schema-based dataset management will help local staff with the process of preparing, validating and archiving key datasets according to the requirements from UNAIDS. Schemas that are designed or approved by UNAIDS determine the design of web forms and validation tools that guide users through the process of uploading essential data. Secure and licensed dataset sharing will give partners confidence that their data should only be used by the parties they trust for the purposes they have agreed. Data access management tools will help organisations understand who has access to use their datasets. Access can be requested, reviewed and granted through the site, but also revoked. This can be done for individual users or for entire organisations. Cloud based archiving and backup of all datasets means that data will not go missing when equipment fails or staff leave. All datasets can be tagged and searched according to their metadata and will be reliably accessible forever. DHIS2 interoperability will enable administrators to share DHIS2 data with all the features and tools provided by the AIDS data repository. Datasets comprising elements automatically pulled from a DHIS2 instance can be added to the site. Periodic pulling of data will ensure that these datasets do not fall out of date. Web-based tools will help administrators configure and monitor the DHIS2 configuration that will likely change over time. Spectrum/Naomi interoperability will streamline the process of preparing and running the Spectrum and HIVE statistical models that are supported by UNAIDS. Web forms and validation tools guide users through the process of preparing the source data sets. These source data sets can then be automatically pulled into the Spectrum and Naomi statistical modelling software tools, which will return the results to the AIDS Data Repository once finished.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Toulouse Campus surveillance Dataset, named ToCaDa, contains two sets of 25 temporally synchronized videos corresponding to two scripted scenarios.
With the help of about 50 persons (actors and camera holders), these videos were shot on July 17th 2017 at 9:50 a.m. and 11:04 a.m. respectively.
Among the cameras: • 9 were located inside the main building and shot from the windows at different floors. All these cameras are focusing the car park and the path leading to the main entrance of the building with large overlapping fields of view. • 8 were located in front of the building and filmed it with large overlapping fields of view. • 8 cameras were arranged further, scattered around the university campus. Each of their views is disjoint from all the others.
About 20 actors were asked to follow two realistic scenarios by performing scripted actions, like driving a car, walking, entering or leaving a building, or holding an item in hand while being filmed.
In addition to ordinary actions, some suspicious behaviors are present.
Irregularities:
Due to the wide variety of devices used during the shooting of the two scenarios, issues were encountered on some cameras, leading to videos where a few seconds are lacking. To ensure temporal synchronization between videos, black frames were added on the missing intervals of time. We list these particular videos and their lacking times below:
F1C3: the first 66 seconds are missing. F1C5: the first 2 seconds are missing. F1C8: the first 3 seconds are missing. F1C13: the first 10 seconds are missing. F1C15: the first second is missing. F1C19: the first second is missing. F2C1: the video is accelerated and only lasts a few seconds. We thus did not provide it. F2C6: lacks from 4:01 to 4:12 and from 4:25 to 4:28. F2C16: lack from 5:15 to 5:26.
Some videos were recorded with mobile devices whose pixel resolution was lower than 1920 x 1080:
F1C3 and F2C3: pixel resolution is 1280 x 720. F1C4 and F2C4: pixel resolution is 640 x 480. F1C15 and F2C15: pixel resolution is 1280 x 720. F1C20 and F2C20: pixel resolution is 1440 x 1080.
More detailed information about the position of the cameras can be found on the following link: http://ubee.enseeiht.fr/dokuwiki/doku.php?id=public:tocada
Citation T. Malon, G. Roman-Jimenez, P. Guyot, S. Chambon, V. Charvillat, A. Crouzil, A. Péninou, J. Pinquier, F. Sèdes and C. Sénac, Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views, ACM Multimedia Systems Conference, 2018.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains datasets for the manuscript "Practical model selection for prospective virtual screening":
If you use this data in a publication, please cite:
Shengchao Liu+, Moayad Alnammi+, Spencer S. Ericksen, Andrew F. Voter, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter. Practical model selection for prospective virtual screening. bioRxiv 2018. doi:10.1101/337956
PubChem data were provided by the PubChem database. Follow the PubChem citation guidelines if you use the PubChem data.
The dataset has N=15384 rows and 5 columns. 15343 rows have no missing values on any column.
This table contains variable names, labels, and number of missing values. See the complete codebook for more.
name | label | n_missing |
---|---|---|
fecha_unidad | NA | 0 |
volumen_totalizador_arboleda_m3 | NA | 38 |
caudal_arboleda_q_m3_hr | NA | 39 |
volumen_totalizador_aranjuez_m3 | NA | 38 |
caudal_aranjuez_q_m3_hr | NA | 41 |
This dataset was automatically described using the codebook R package (version 0.9.2).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
👍 If this dataset was useful to you, leave your vote at the top of the page 👍
The dataset provides information on the daily top 200 tracks listened to by users of the Spotify digital platform around the world.
I put together this dataset because I really love music (I listen to it for several hours a day) and have not found a similar dataset with track genres on kaggle.
The dataset can be useful for beginners in the field of working with data. It contains missing values, arrays in columns, and so on, which can be great practice when conducting an EDA phase.
Soon, my example will appear here as possible, based on the specified dataset, go on a musical journey around the world and understand how the musical tastes of humanity have changed around the world)))
In addition, I will be very happy to see the work of the community on this dataset.
Also, in case of interest in data by country, I am ready to place it upon request.
You can contact me through: telegram @natarov_ivan