Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This anonymized data set consists of one month's (October 2018) web tracking data of 2,148 German users. For each user, the data contains the anonymized URL of the webpage the user visited, the domain of the webpage, category of the domain, which provides 41 distinct categories. In total, these 2,148 users made 9,151,243 URL visits, spanning 49,918 unique domains. For each user in our data set, we have self-reported information (collected via a survey) about their gender and age.
We acknowledge the support of Respondi AG, which provided the web tracking and survey data free of charge for research purposes, with special thanks to François Erner and Luc Kalaora at Respondi for their insights and help with data extraction.
The data set is analyzed in the following paper:
The code used to analyze the data is also available at https://github.com/gesiscss/web_tracking.
If you use data or code from this repository, please cite the paper above and the Zenodo link.
Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio
The dataset collection in question is a compilation of related data tables sourced from the website of Tilastokeskus (Statistics Finland) in Finland. The data present in the collection is organized in a tabular format comprising of rows and columns, each holding related data. The collection includes several tables, each of which represents different years, providing a temporal view of the data. The description provided by the data source, Tilastokeskuksen palvelurajapinta (Statistics Finland's service interface), suggests that the data is likely to be statistical in nature and could be related to regional statistics, given the nature of the source. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).
http://dcat-ap.ch/vocabulary/licenses/terms_byhttp://dcat-ap.ch/vocabulary/licenses/terms_by
The data on the use of the data sets on the OGD portal BL (data.bl.ch) are collected and published by the specialist and coordination office OGD BL. Contains the day the usage was measured.dataset_title: The title of the dataset_id record: The technical ID of the dataset.visitors: Specifies the number of daily visitors to the record. Visitors are recorded by counting the unique IP addresses that recorded access on the day of the survey. The IP address represents the network address of the device from which the portal was accessed.interactions: Includes all interactions with any record on data.bl.ch. A visitor can trigger multiple interactions. Interactions include clicks on the website (searching datasets, filters, etc.) as well as API calls (downloading a dataset as a JSON file, etc.).RemarksOnly calls to publicly available datasets are shown.IP addresses and interactions of users with a login of the Canton of Basel-Landschaft - in particular of employees of the specialist and coordination office OGD - are removed from the dataset before publication and therefore not shown.Calls from actors that are clearly identifiable as bots by the user agent header are also not shown.Combinations of dataset and date for which no use occurred (Visitors == 0 & Interactions == 0) are not shown.Due to synchronization problems, data may be missing by the day.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘K-Pop Hits Through The Years’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sberj127/kpop-hits-through-the-years on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The datasets contain the top songs from the said era or year accordingly (as presented in the name of each dataset). Note that only the KPopHits90s dataset represents an era (1989-2001). Although there is a lack of easily available and reliable sources to show the actual K-Pop hits per year during the 90s, this era was still included as this time period was when the first generation of K-Pop stars appeared. Each of the other datasets represent a specific year after the 90s.
A song is considered to be a K-Pop hit during that era or year if it is included in the annual series of K-Pop Hits playlists, which is created officially by Apple Music. Note that for the dataset that represents the 90s, the playlist 90s K-Pop Essentials was used as the reference.
As someone who has a particular curiosity to the field of data science and a genuine love for the musicality in the K-Pop scene, this data set was created to make something out of the strong interest I have for these separate subjects.
I would like to express my sincere gratitude to Apple Music for creating the annual K-Pop playlists, Spotify for making their API very accessible, Spotipy for making it easier to get the desired data from the Spotify Web API, Tune My Music for automating the process of transferring one's library into another service's library and, of course, all those involved in the making of these songs and artists included in these datasets for creating such high quality music and concepts digestible even for the general public.
--- Original source retains full ownership of the source dataset ---
The dataset collection in question is a compilation of statistical area data. It includes one or more tables of interconnected data, structured in the form of rows and columns. The data in the collection is sourced from the 'Statistics Centre' (Tilastokeskus), a recognized institution in Finland. The description provided by the data source, translated to English, is 'Statistical Centre's Service Interface (WFS)'. This suggests that the dataset collection is likely a representation of statistical data provided through a web feature service by the Statistics Centre. The dataset collection might include various statistical area details, possibly related to the greater area of 1000 square kilometers, as suggested by the year 2015, which may indicate the time period the data covers. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).
This dataset includes aggregated weekly data on the percent of emergency department visits and the percent of hospital inpatient admissions due to influenza-like illness (ILI), COVID-19, influenza, RSV, and acute respiratory illness. The Illinois Department of Public Health (IDPH) collects data for Emergency Department visits to all 185 acute care hospitals in Illinois. The data are submitted from IDPH to the CDC’s BioSense Platform for access and analysis by health departments via the ESSENCE system. The CDC National Syndromic Surveillance Program (NSSP) utilizes diagnostic codes and clinical terms to create definitions for diagnosed COVID-19, influenza, RSV, and acute respiratory illness. For more information on diagnostic codes and clinical terms used, visit: https://www.cdc.gov/nssp/php/onboarding-resources/companion-guide-ed-data-respiratory-illness.html The data is characterized by selected demographic groups including age group and race/ethnicity. The dataset also includes percent of weekly outpatient visits due to ILI as reported by several outpatient clinics throughout Chicago that participate in CDC’s Influenza-like Illness Surveillance Network (ILINet). For more information on ESSENCE, see https://www.dph.illinois.gov/data-statistics/syndromic-surveillance For more information on ILINet, see https://www.cdc.gov/fluview/overview/index.html#cdc_generic_section_3-outpatient-illness-surveillance All data are provisional and subject to change. Information is updated as additional details are received. At any given time, this dataset reflects data currently known to CDPH. Numbers in this dataset may differ from other public sources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Michigan Public Policy Survey (MPPS) is a program of state-wide surveys of local government leaders in Michigan. The MPPS is designed to fill an important information gap in the policymaking process. While there are ongoing surveys of the business community and of the citizens of Michigan, before the MPPS there were no ongoing surveys of local government officials that were representative of all general purpose local governments in the state. Therefore, while we knew the policy priorities and views of the state's businesses and citizens, we knew very little about the views of the local officials who are so important to the economies and community life throughout Michigan. The MPPS was launched in 2009 by the Center for Local, State, and Urban Policy (CLOSUP) at the University of Michigan and is conducted in partnership with the Michigan Association of Counties, Michigan Municipal League, and Michigan Townships Association. The associations provide CLOSUP with contact information for the survey's respondents, and consult on survey topics. CLOSUP makes all decisions on survey design, data analysis, and reporting, and receives no funding support from the associations. The surveys investigate local officials' opinions and perspectives on a variety of important public policy issues and solicit factual information about their localities relevant to policymaking. Over time, the program has covered issues such as fiscal, budgetary and operational policy, fiscal health, public sector compensation, workforce development, local-state governmental relations, intergovernmental collaboration, economic development strategies and initiatives such as placemaking and economic gardening, the role of local government in environmental sustainability, energy topics such as hydraulic fracturing ("fracking") and wind power, trust in government, views on state policymaker performance, opinions on the impacts of the Federal Stimulus Program (ARRA), and more. The program will investigate many other issues relevant to local and state policy in the future. A searchable database of every question the MPPS has asked is available on CLOSUP's website. Results of MPPS surveys are currently available as reports, and via online data tables. Out of a commitment to promoting public knowledge of Michigan local governance, the Center for Local, State, and Urban Policy is releasing public use datasets. In order to protect respondent confidentiality, CLOSUP has divided the data collected in each wave of the survey into separate datasets focused on different topics that were covered in the survey. Each dataset contains only variables relevant to that subject, and the datasets cannot be linked together. Variables have also been omitted or recoded to further protect respondent confidentiality. For researchers looking for a more extensive release of the MPPS data, restricted datasets are available through openICPSR's Virtual Data Enclave. Please note: additional waves of MPPS public use datasets are being prepared, and will be available as part of this project as soon as they are completed. For information on accessing MPPS public use and restricted datasets, please visit the MPPS data access page: http://closup.umich.edu/mpps-download-datasets
Individual visits to El Pueblo museums, per month. *The Museum of Social Justice is an independently operated museum, and reopened to the public May 2021. All El Pueblo-operated museums partially reopened June 10, 2021.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments.
As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This was done using two different approaches: one based on expert opinions, and one based on statistical learning. This GDR submission includes the datasets used to produce the statistical learning-based weights.
While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. The drawback is that, to apply these types of approaches, a dataset is needed. Therefore, we attempted to build comprehensive standardized datasets mapping anomalies in each exploration dataset to each component of each play. This data was gathered through a literature review focused on magmatic hydrothermal plays along with well-characterized areas where superhot or supercritical conditions are thought to exist. Datasets were assembled for all three play types, but the hydrothermal dataset is the least complete due to its relatively low priority.
For each known or assumed resource, the dataset states what anomaly in each exploration dataset is associated with each component of the system. The data is only a semi-quantitative, where values are either high, medium, or low, relative to background levels. In addition, the dataset has significant gaps, as not every possible exploration dataset has been collected and analyzed at every known or suspected geothermal resource area, in the context of all possible play types. The following training sites were used to assemble this dataset: - Conventional magmatic hydrothermal: Akutan (from AK PFA), Oregon Cascades PFA, Glass Buttes OR, Mauna Kea (from HI PFA), Lanai (from HI PFA), Mt St Helens Shear Zone (from WA PFA), Wind River Valley (From WA PFA), Mount Baker (from WA PFA). - Superhot EGS: Newberry (EGS demonstration project), Coso (EGS demonstration project), Geysers (EGS demonstration project), Eastern Snake River Plain (EGS demonstration project), Utah FORGE, Larderello, Kakkonda, Taupo Volcanic Zone, Acoculco, Krafla. - Supercritical: Coso, Geysers, Salton Sea, Larderello, Los Humeros, Taupo Volcanic Zone, Krafla, Reyjanes, Hengill. **Disclaimer: Treat the supercritical fluid anomalies with skepticism. They are based on assumptions due to the general lack of confirmed supercritical fluid encounters and samples at the sites included in this dataset, at the time of assembling the dataset. The main assumption was that the supercritical fluid in a given geothermal system has shared properties with the hydrothermal fluid, which may not be the case in reality.
Once the datasets were assembled, principal component analysis (PCA) was applied to each. PCA is an unsupervised statistical learning technique, meaning that labels are not required on the data, that summarized the directions of variance in the data. This approach was chosen because our labels are not certain, i.e., we do not know with 100% confidence that superhot resources exist at all the assumed positive areas. We also do not have data for any known non-geothermal areas, meaning that it would be challenging to apply a supervised learning technique. In order to generate weights from the PCA, an analysis of the PCA loading values was conducted. PCA loading values represent how much a feature is contributing to each principal component, and therefore the overall variance in the data.
https://www.icpsr.umich.edu/web/ICPSR/studies/32721/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/32721/terms
The Study of Women's Health Across the Nation (SWAN), is a multi-site longitudinal, epidemiologic study designed to examine the health of women during their middle years. The study examines the physical, biological, psychological and social changes during this transitional period. The goal of SWAN's research is to help scientists, health care providers and women learn how mid-life experiences affect health and quality of life during aging. Data were collected about doctor visits, medical conditions, medications, treatments, medical procedures, relationships, smoking, and menopause related information such as age at pre-, peri- and post-menopause, self-attitudes, feelings, and common physical problems associated with menopause. The study began in 1994. Between 2005 and 2007, 2,255 of the 3,302 women that joined SWAN were seen for their ninth follow-up visit. The research centers are located in the following communities: Ypsilanti and Inkster, MI (University of Michigan); Boston, MA (Massachusetts General Hospital); Chicago, IL (Rush Presbyterian-St. Luke's Medical Center); Alameda and Contra Costa County, CA (University of California-Davis and Kaiser Permanente); Los Angeles, CA (University of California-Los Angeles); Hackensack, NJ (Hackensack University Medical Center); and Pittsburgh, PA (University of Pittsburgh). SWAN participants represent five racial/ethnic groups and a variety of backgrounds and cultures. Though the New Jersey site was still part of the study, data was not collected from this site for the ninth visit. Demographic and background information includes age, language of interview, marital status, household composition, and employment.
The dataset contains contact and description information for local supply chain organizations, offshore wind developers, and original equipment manufacturers that provide goods and services to support New York State’s offshore wind industry. To request placement in this database, or to update your company’s information, please visit NYSERDA’s Supply Chain Database webpage at https://www.nyserda.ny.gov/All-Programs/Offshore-Wind/Focus-Areas/Supply-Chain-Economic-Development/Supply-Chain-Database to submit a request form. How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov. The New York State Energy Research and Development Authority (NYSERDA) offers objective information and analysis, innovative programs, technical expertise, and support to help New Yorkers increase energy efficiency, save money, use renewable energy, and reduce reliance on fossil fuels. To learn more about NYSERDA’s programs, visit https://nyserda.ny.gov or follow us on Twitter, Facebook, YouTube, or Instagram.
This dataset is a complete inventory of all assets on this site and any assets sourced from other sites, if applicable. Use this dataset to track the performance of data publishing, conduct metadata maintenance, or present an overview of what kinds of data exists on the site.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Asset inventory data for a variety of structures and infrastructure relating to water systems or drainage in urban areas. The features in this dataset are measured by length and represent linear features such as pipe networks or open drains. The information is extracted from the asset inventory database on a daily basis. Items identified have been geolocated over a long period of time and through various methods, including information provided by 3rd parties. In general, asset locations are obtained from as built diagrams and as such may not be validated in all circumstances. The asset inventory is frequently updated and modification can be made to the asset data structure (asset hierarchy) without prior notification. Due to a wide range of source information all asset locations should be verified through the Asset Information Officers and or site visits. This is an incomplete dataset, other information is held and maintained independently.The primary purpose of this inventory is for asset valuations. The inventory is utilised in forward works and capital work planning. Information on Water Supply assets for service requests is displayed on 3 Waters map. The Water Supply network is an integral part of the land use and consents process, however site visits should be done to validate the status, position and condition of assets.Waikato OneView does not make any representation or give any warranty as to the accuracy or exhaustiveness of the data released for public download. Locations and dimensions of assets depicted in the data may not be accurate due to circumstances not notified to Council. While you are free to crop, export and re-purpose the data, we ask that you attribute the Waikato OneView and clearly state that your work is a derivative and not the authoritative data source.
This dataset contains counts and rates (per 10,000 residents) of asthma emergency department (ED) visits among Californians. The table “Asthma Emergency Department Visit Rates by County” contains statewide and county-level data stratified by age group (all ages, 0-17, 18+, 0-4, 5-17, 18-64, 65+) and race/ethnicity (white, black, Hispanic, Asian/Pacific Islander, American Indian/Alaskan Native). The table “Asthma Emergency Department Visit Rates by ZIP Code” contains zip-code level data stratified by age group (all ages, 0-17, 18+). The data are derived from the Department of Health Care Access and Information emergency department database. These data include emergency department visits from all licensed hospitals in California. These data are based only on primary discharge diagnosis codes. On October 1, 2015, diagnostic coding for asthma transitioned from ICD9-CM (493) to ICD10-CM (J45). Because of this change, CDPH and CDC do not recommend comparing data from 2015 (or earlier) to 2016 (or later). NOTE: Rates are calculated from the total number of asthma emergency department visits (not the unique number of individuals).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Coursera Course Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/siddharthm1698/coursera-course-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This is a dataset i generated during a hackathon for project purpose. Here i have scrapped data from Coursera official web site. Our project aims to help any new learner get the right course to learn by just answering a few questions. It is an intelligent course recommendation system. Hence we had to scrap data from few educational websites. This is data scrapped from Coursera website. For the project visit: https://github.com/Siddharth1698/Coursu . Please do show your support by following us. I have just started to learn on data science and hope this dataset will be helpful to someone for his/her personal purposes. The scrapping code is here : https://github.com/Siddharth1698/Coursera-Course-Dataset Article about the dataset generation : https://medium.com/analytics-vidhya/web-scraping-and-coursera-8db6af45d83f
This dataset contains mainly 6 columns and 890 course data. The detailed description: 1. course_title : Contains the course title. 2. course_organization : It tells which organization is conducting the courses. 3. course_Certificate_type : It has details about what are the different certifications available in courses. 4. course_rating : It has the ratings associated with each course. 5. course_difficulty : It tells about how difficult or what is the level of the course. 6. course_students_enrolled : It has the number of students that are enrolled in the course.
This is just one of my first scrapped dataset. Follow my GitHub for more: https://github.com/Siddharth1698
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Asset inventory data for a variety of structures and infrastructure relating to storm water systems or drainage in urban areas. The features in this dataset are measured by length and represent linear features such as pipe networks or open drains. The information is extracted from the asset inventory database (Asset-Finda) on a daily basis. Items identified have been geolocated over a long period of time and through various methods, including information provided by 3rd parties. In general, asset locations are obtained from as built diagrams and as such may not be validated in all circumstances. The asset inventory is frequently updated and modification can be made to the asset data structure (asset hierarchy) without prior notification. Due to a wide range of source information all asset locations should be verified through the Asset Information Officers and or site visits. This is an incomplete dataset, other information is held and maintained independently. Waikato District Alliance holds storm water asset information for all assets under the road pavement. Waikato Regional Council holds further asset information in all the rural areas The primary purpose of this inventory is for asset valuations. The inventory is utilised in forward works and capital work planning. Information on Storm water assets for service requests is displayed on 3 Waters map. The storm water network is an integral part of the land use and consents process, however site visits should be done to validate the status, position and condition of assets.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
YouTube maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year”.
Note that this dataset is a structurally improved version of this dataset.
This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the IN, US, GB, DE, CA, FR, RU, BR, MX, KR, and JP regions (India, USA, Great Britain, Germany, Canada, France, Russia, Brazil, Mexico, South Korea, and, Japan respectively), with up to 200 listed trending videos per day.
Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.
The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the 11 regions in the dataset.
For more information on specific columns in the dataset refer to the column metadata.
This dataset was collected using the YouTube API. This dataset is the updated version of Trending YouTube Video Statistics.
Possible uses for this dataset could include: - Sentiment analysis in a variety of forms - Categorizing YouTube videos based on their comments and statistics. - Training ML algorithms like RNNs to generate their own YouTube comments. - Analyzing what factors affect how popular a YouTube video will be. - Statistical analysis over time.
For further inspiration, see the kernels on this dataset!
A Grand Site operation is the approach proposed by the State to local and regional authorities in order to respond to the difficulties posed by welcoming visitors and maintaining sites classified as highly known and subject to high traffic. It makes it possible to define and implement a concerted project for the restoration, preservation and development of the territory.
It applies to a site classified under Articles L.341-1 to 22 of the Environmental Code (Law of 2 May 1930) faced with a problem of tourist use or maintenance for which management decisions of the site are required. Its purpose is to accompany the territory towards the eventual acquisition of the Grand Site de France label This label “Grand site de France”, owned by the State, has a legal scope since 2010 (Article L. 341-15-1 of the Environmental Code)
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This anonymized data set consists of one month's (October 2018) web tracking data of 2,148 German users. For each user, the data contains the anonymized URL of the webpage the user visited, the domain of the webpage, category of the domain, which provides 41 distinct categories. In total, these 2,148 users made 9,151,243 URL visits, spanning 49,918 unique domains. For each user in our data set, we have self-reported information (collected via a survey) about their gender and age.
We acknowledge the support of Respondi AG, which provided the web tracking and survey data free of charge for research purposes, with special thanks to François Erner and Luc Kalaora at Respondi for their insights and help with data extraction.
The data set is analyzed in the following paper:
The code used to analyze the data is also available at https://github.com/gesiscss/web_tracking.
If you use data or code from this repository, please cite the paper above and the Zenodo link.