Daily utilization metrics for data.lacity.org and geohub.lacity.org. Updated monthly
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Explore our detailed website traffic dataset featuring key metrics like page views, session duration, bounce rate, traffic source, and conversion rates.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset is a comprehensive collection of 15,150 classic hits from 3,083 artists, spanning a century of music history from 1923 to 2023. This diverse dataset is divided into 19 distinct genres, showcasing the evolution of popular music across different eras and styles. Each track in the dataset is enriched with Spotify audio features, offering detailed insights into the acoustic properties, rhythm, tempo, and other musical characteristics. This makes the dataset not only a valuable resource for exploring trends and comparing genres but also for analyzing the sonic qualities that define classic hits across different time periods and genres.
💁♀️Please take a moment to carefully read through this description and metadata to better understand the dataset and its nuances before proceeding to the Suggestions and Discussions section.
This dataset compiles the tracks from Spotify's official "Top Tracks of 2023" playlist, showcasing the most popular and influential music of the year according to Spotify's streaming data. It represents a wide range array of genres, artists, and musical styles that have defined the musical landscapes of 2023. Each track in the dataset is detailed with a variety of features, popularity, and metadata. This dataset serves as an excellent resource for music enthusiasts, data analysts, and researchers aiming to explore music trends or develop music recommendation systems based on empirical data.
The data was obtained directly from the Spotify Web API, specifically from the "Top Tracks of 2023" official playlist curated by Spotify. The Spotify API provides detailed information about tracks, artists, and albums through various endpoints.
To process and structure the data, I developed Python scripts using data science libraries such as pandas
for data manipulation and spotipy
for API interactions specifically for Spotify data retrieval.
I encourage users who discover new insights, propose dataset enhancements, or craft analytics that illuminate aspects of the dataset's focus to share their findings with the community. - Kaggle Notebooks: To facilitate sharing and collaboration, users are encouraged to create and share their analyses through Kaggle notebooks. For ease of use, start your notebook by clicking "New Notebook" atop this dataset’s page on K...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network traffic datasets created by Single Flow Time Series Analysis
Datasets were created for the paper: Network Traffic Classification based on Single Flow Time Series Analysis -- Josef Koumar, Karel Hynek, Tomáš Čejka -- which was published at The 19th International Conference on Network and Service Management (CNSM) 2023. Please cite usage of our datasets as:
J. Koumar, K. Hynek and T. Čejka, "Network Traffic Classification Based on Single Flow Time Series Analysis," 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 2023, pp. 1-7, doi: 10.23919/CNSM59352.2023.10327876.
This Zenodo repository contains 23 datasets created from 15 well-known published datasets which are cited in the table below. Each dataset contains 69 features created by Time Series Analysis of Single Flow Time Series. The detailed description of features from datasets is in the file: feature_description.pdf
In the following table is a description of each dataset file:
File name | Detection problem | Citation of original raw dataset |
botnet_binary.csv | Binary detection of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
botnet_multiclass.csv | Multi-class classification of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
cryptomining_design.csv | Binary detection of cryptomining; the design part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
cryptomining_evaluation.csv | Binary detection of cryptomining; the evaluation part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
dns_malware.csv | Binary detection of malware DNS | Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021. |
doh_cic.csv | Binary detection of DoH |
Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020 |
doh_real_world.csv | Binary detection of DoH | Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022 |
dos.csv | Binary detection of DoS | Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019. |
edge_iiot_binary.csv | Binary detection of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
edge_iiot_multiclass.csv | Multi-class classification of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
https_brute_force.csv | Binary detection of HTTPS Brute Force | Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020 |
ids_cic_binary.csv | Binary detection of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
ids_cic_multiclass.csv | Multi-class classification of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
ids_unsw_nb_15_binary.csv | Binary detection of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
ids_unsw_nb_15_multiclass.csv | Multi-class classification of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
iot_23.csv | Binary detection of IoT malware | Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23 |
ton_iot_binary.csv | Binary detection of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
ton_iot_multiclass.csv | Multi-class classification of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
tor_binary.csv | Binary detection of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
tor_multiclass.csv | Multi-class classification of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
vpn_iscx_binary.csv | Binary detection of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
vpn_iscx_multiclass.csv | Multi-class classification of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
vpn_vnat_binary.csv | Binary detection of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
vpn_vnat_multiclass.csv | Multi-class classification of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SDCC Traffic Congestion Saturation Flow Data for January to June 2023. Traffic volumes, traffic saturation, and congestion data for sites across South Dublin County. Used by traffic management to control stage timings on junctions. It is recommended that this dataset is read in conjunction with the ‘Traffic Data Site Names SDCC’ dataset.A detailed description of each column heading can be referenced below;scn: Site Serial numberregion: A group of Nodes that are operated under SCOOT control at the same common cycle time. Normally these will be nodes between which co-ordination is desirable. Some of the nodes may be double cycling at half of the region cycle time.system: SCOOT STC UTC (UTC-MX)locn: Locationssite: Site numbersday: Days of the week Monday to Sunday. Abbreviations; MO,TU,WE,TH,FR,SA,SU.date: Reflects correct actual Date of when data was collected.start_time: NOTE - Please ignore the date displayed in this column. The actual data collection date is correctly displayed in the 'date' column. The date displayed here is the date of when report was run and extracted from the system, but correctly reflects start time of 15 minute intervals. end_time: End time of 15 minute intervals.flow: A representation of demand (flow) for each link built up over several minutes by the SCOOT model. SCOOT has two profiles:(1) Short – Raw data representing the actual values over the previous few minutes(2) Long – A smoothed average of values over a longer periodSCOOT will choose to use the appropriate profile depending on a number of factors.flow_pc: Same as above ref PC SCOOTcong: Congestion is directly measured from the detector. If the detector is placed beyond the normal end of queue in the street it is rarely covered by stationary traffic, except of course when congestion occurs. If any detector shows standing traffic for the whole of an interval this is recorded. The number of intervals of congestion in any cycle is also recorded.The percentage congestion is calculated from:No of congested intervals x 4 x 100 cycle time in seconds.This percentage of congestion is available to view and more importantly for the optimisers to take into account.cong_pc: Same as above ref PC SCOOTdsat: The ratio of the demand flow to the maximum possible discharge flow, i.e. it is the ratio of the demand to the discharge rate (Saturation Occupancy) multiplied by the duration of the effective green time. The Split optimiser will try to minimise the maximum degree of saturation on links approaching the node.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.
The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:
Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.
Fork this kernel to get started.
Banner Photo by Edho Pratama from Unsplash.
What is the total number of transactions generated per device browser in July 2017?
The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?
What was the average number of product pageviews for users who made a purchase in July 2017?
What was the average number of product pageviews for users who did not make a purchase in July 2017?
What was the average total transactions per user that made a purchase in July 2017?
What is the average amount of money spent per session in July 2017?
What is the sequence of pages viewed?
** Note to Josh Harrill- I don't have a copy of the final manuscript so could you please add the description of this dataset (just delete this comment and enter or cut and paste and then it should be ready to route by clicking on 'Submit for Review' button above) *. This dataset is associated with the following publication: Nyffeler, J., D. Haggard, C. Willis, W. Setzer, R. Judson, K. Paul-Friedman, L. Everett, and J. Harrill. Comparison of Approaches for Determining Bioactivity Hits from High-Dimensional Profiling Data. SLAS Discovery. SAGE Publications, THOUSAND OAKS, CA, USA, 26(2): 292-308, (2021).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of Tor cell file extracted from browsing simulation using Tor Browser. The simulations cover both desktop and mobile webpages. The data collection process was using WFP-Collector tool (https://github.com/irsyadpage/WFP-Collector). All the neccessary configuration to perform the simulation as detailed in the tool repository.The webpage URL is selected by using the first 100 website based on: https://dataforseo.com/free-seo-stats/top-1000-websites.Each webpage URL is visited 90 times for each deskop and mobile browsing mode.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Difference uses Google Analytics as the Baseline. Results based on Paired t-Test for Hypotheses Supported.
Our dataset provides detailed and precise insights into the business, commercial, and industrial aspects of any given area in the USA (Including Point of Interest (POI) Data and Foot Traffic. The dataset is divided into 150x150 sqm areas (geohash 7) and has over 50 variables. - Use it for different applications: Our combined dataset, which includes POI and foot traffic data, can be employed for various purposes. Different data teams use it to guide retailers and FMCG brands in site selection, fuel marketing intelligence, analyze trade areas, and assess company risk. Our dataset has also proven to be useful for real estate investment.- Get reliable data: Our datasets have been processed, enriched, and tested so your data team can use them more quickly and accurately.- Ideal for trainning ML models. The high quality of our geographic information layers results from more than seven years of work dedicated to the deep understanding and modeling of geospatial Big Data. Among the features that distinguished this dataset is the use of anonymized and user-compliant mobile device GPS location, enriched with other alternative and public data.- Easy to use: Our dataset is user-friendly and can be easily integrated to your current models. Also, we can deliver your data in different formats, like .csv, according to your analysis requirements. - Get personalized guidance: In addition to providing reliable datasets, we advise your analysts on their correct implementation.Our data scientists can guide your internal team on the optimal algorithms and models to get the most out of the information we provide (without compromising the security of your internal data).Answer questions like: - What places does my target user visit in a particular area? Which are the best areas to place a new POS?- What is the average yearly income of users in a particular area?- What is the influx of visits that my competition receives?- What is the volume of traffic surrounding my current POS?This dataset is useful for getting insights from industries like:- Retail & FMCG- Banking, Finance, and Investment- Car Dealerships- Real Estate- Convenience Stores- Pharma and medical laboratories- Restaurant chains and franchises- Clothing chains and franchisesOur dataset includes more than 50 variables, such as:- Number of pedestrians seen in the area.- Number of vehicles seen in the area.- Average speed of movement of the vehicles seen in the area.- Point of Interest (POIs) (in number and type) seen in the area (supermarkets, pharmacies, recreational locations, restaurants, offices, hotels, parking lots, wholesalers, financial services, pet services, shopping malls, among others). - Average yearly income range (anonymized and aggregated) of the devices seen in the area.Notes to better understand this dataset:- POI confidence means the average confidence of POIs in the area. In this case, POIs are any kind of location, such as a restaurant, a hotel, or a library. - Category confidences, for example"food_drinks_tobacco_retail_confidence" indicates how confident we are in the existence of food/drink/tobacco retail locations in the area. - We added predictions for The Home Depot and Lowe's Home Improvement stores in the dataset sample. These predictions were the result of a machine-learning model that was trained with the data. Knowing where the current stores are, we can find the most similar areas for new stores to open.How efficient is a Geohash?Geohash is a faster, cost-effective geofencing option that reduces input data load and provides actionable information. Its benefits include faster querying, reduced cost, minimal configuration, and ease of use.Geohash ranges from 1 to 12 characters. The dataset can be split into variable-size geohashes, with the default being geohash7 (150m x 150m).
The Traffic Camera dataset contains the location and number for every Traffic camera in the City of Toronto. These datasets will be updated within 2 minutes when cameras are added, changed, or removed. The camera list files can be found at: https://opendata.toronto.ca/transportation/tmc/rescucameraimages/Data/ tmcearthcameras.csv - CSV, camera list in CSV tmcearthcameras.json - json formatted list. tmcearthcamerassn.json - json formatted file containing the timestamp of the list files. tmcearthcameras.xml - xml formatted list. TMCEarthCameras.xsd - xml schema document. The dataset includes the number, name, WGS84 information (latitude, longitude), comparison directions (1- Looking North, 2-Looking East, 3-Looking South and 4-Looking West), and camera group. The camera images associated with the dataset can be found at: https://opendata.toronto.ca/transportation/tmc/rescucameraimages/CameraImages. And the comparison images can be found at: https://opendata.toronto.ca/transportation/tmc/rescucameraimages/ComparisonImages. The camera image file name is created as follows: loc####.jpg - where #### is the camera number. (i.e. loc1234.jpg) The camera comparison image file names are created as follows: loc####D.jpg - where #### is the camera number and D is the direction. (i.e. loc1234e.jpg and loc1234w.jpg) The camera images are displayed on the City's website at http://www.toronto.ca/rescu/index.htmor http://www.toronto.ca/rescu/list.htm
A. SUMMARY This dataset consists of San Francisco International Airport (SFO) The aircraft landing dataset contains data about aircraft landings at SFO with monthly landing counts and landed weight by airline, region and aircraft model and type. B. HOW THE DATASET IS CREATED Data is self-reported by airlines and is only available at a monthly level. C. UPDATE PROCESS Data is available starting in July 1999 and will be updated monthly. D. HOW TO USE THIS DATASET Airport data is seasonal in nature; therefore, any comparative analyses should be done on a period-over-period basis (i.e. January 2010 vs. January 2009) as opposed to period-to-period (i.e. January 2010 vs. February 2010). It is also important to note that fact and attribute field relationships are not always 1-to-1. For example, Cargo Statistics belonging to United Airlines will appear in multiple attribute fields and are additive, which provides flexibility for the user to derive categorical Cargo Statistics as desired. E. RELATED DATASETS A summary of monthly comparative air-traffic statistics is also available on SFO’s internet site at https://www.flysfo.com/about/media/facts-statistics/air-traffic-statistics
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘K-Pop Hits Through The Years’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sberj127/kpop-hits-through-the-years on 12 November 2021.
--- Dataset description provided by original source is as follows ---
The datasets contain the top songs from the said era or year accordingly (as presented in the name of each dataset). Note that only the KPopHits90s dataset represents an era (1989-2001). Although there is a lack of easily available and reliable sources to show the actual K-Pop hits per year during the 90s, this era was still included as this time period was when the first generation of K-Pop stars appeared. Each of the other datasets represent a specific year after the 90s.
A song is considered to be a K-Pop hit during that era or year if it is included in the annual series of K-Pop Hits playlists, which is created officially by Apple Music. Note that for the dataset that represents the 90s, the playlist 90s K-Pop Essentials was used as the reference.
As someone who has a particular curiosity to the field of data science and a genuine love for the musicality in the K-Pop scene, this data set was created to make something out of the strong interest I have for these separate subjects.
I would like to express my sincere gratitude to Apple Music for creating the annual K-Pop playlists, Spotify for making their API very accessible, Spotipy for making it easier to get the desired data from the Spotify Web API, Tune My Music for automating the process of transferring one's library into another service's library and, of course, all those involved in the making of these songs and artists included in these datasets for creating such high quality music and concepts digestible even for the general public.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SDCC Traffic Congestion Saturation Flow Data for January to June 2020. Traffic volumes, traffic saturation, and congestion data for sites across South Dublin County. Used by traffic management to control stage timings on junctions. It is recommended that this dataset is read in conjunction with the ‘Traffic Data Site Names SDCC’ dataset.A detailed description of each column heading can be referenced below;scn: Site Serial numberregion: A group of Nodes that are operated under SCOOT control at the same common cycle time. Normally these will be nodes between which co-ordination is desirable. Some of the nodes may be double cycling at half of the region cycle time.system: SCOOT STC UTC (UTC-MX)locn: Locationssite: Site numbersday: Days of the week Monday to Sunday. Abbreviations; MO,TU,WE,TH,FR,SA,SU.date: Reflects correct actual Date of when data was collected.start_time: NOTE - Please ignore the date displayed in this column. The actual data collection date is correctly displayed in the 'date' column. The date displayed here is the date of when report was run and extracted from the system, but correctly reflects start time of 15 minute intervals. end_time: End time of 15 minute intervals.flow: A representation of demand (flow) for each link built up over several minutes by the SCOOT model. SCOOT has two profiles:(1) Short – Raw data representing the actual values over the previous few minutes(2) Long – A smoothed average of values over a longer periodSCOOT will choose to use the appropriate profile depending on a number of factors.flow_pc: Same as above ref PC SCOOTcong: Congestion is directly measured from the detector. If the detector is placed beyond the normal end of queue in the street it is rarely covered by stationary traffic, except of course when congestion occurs. If any detector shows standing traffic for the whole of an interval this is recorded. The number of intervals of congestion in any cycle is also recorded.The percentage congestion is calculated from:No of congested intervals x 4 x 100 cycle time in seconds.This percentage of congestion is available to view and more importantly for the optimisers to take into account.cong_pc: Same as above ref PC SCOOTdsat: The ratio of the demand flow to the maximum possible discharge flow, i.e. it is the ratio of the demand to the discharge rate (Saturation Occupancy) multiplied by the duration of the effective green time. The Split optimiser will try to minimise the maximum degree of saturation on links approaching the node.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General data recollected for the studio " Analysis of the Quantitative Impact of Social Networks on Web Traffic of Cybermedia in the 27 Countries of the European Union".
Four research questions are posed: what percentage of the total web traffic generated by cybermedia in the European Union comes from social networks? Is said percentage higher or lower than that provided through direct traffic and through the use of search engines via SEO positioning? Which social networks have a greater impact? And is there any degree of relationship between the specific weight of social networks in the web traffic of a cybermedia and circumstances such as the average duration of the user's visit, the number of page views or the bounce rate understood in its formal aspect of not performing any kind of interaction on the visited page beyond reading its content?
To answer these questions, we have first proceeded to a selection of the cybermedia with the highest web traffic of the 27 countries that are currently part of the European Union after the United Kingdom left on December 31, 2020. In each nation we have selected five media using a combination of the global web traffic metrics provided by the tools Alexa (https://www.alexa.com/), which ceased to be operational on May 1, 2022, and SimilarWeb (https:// www.similarweb.com/). We have not used local metrics by country since the results obtained with these first two tools were sufficiently significant and our objective is not to establish a ranking of cybermedia by nation but to examine the relevance of social networks in their web traffic.
In all cases, cybermedia whose property corresponds to a journalistic company have been selected, ruling out those belonging to telecommunications portals or service providers; in some cases they correspond to classic information companies (both newspapers and televisions) while in others they refer to digital natives, without this circumstance affecting the nature of the research proposed.
Below we have proceeded to examine the web traffic data of said cybermedia. The period corresponding to the months of October, November and December 2021 and January, February and March 2022 has been selected. We believe that this six-month stretch allows possible one-time variations to be overcome for a month, reinforcing the precision of the data obtained.
To secure this data, we have used the SimilarWeb tool, currently the most precise tool that exists when examining the web traffic of a portal, although it is limited to that coming from desktops and laptops, without taking into account those that come from mobile devices, currently impossible to determine with existing measurement tools on the market.
It includes:
Web traffic general data: average visit duration, pages per visit and bounce rate Web traffic origin by country Percentage of traffic generated from social media over total web traffic Distribution of web traffic generated from social networks Comparison of web traffic generated from social netwoks with direct and search procedures
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The ability to interpret the predictions made by quantitative structure–activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package (https://r-forge.r-project.org/R/?group_id=1725) for the R statistical programming language and the Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for heat map generation.
The dataset contains traffic collected for 96 websites located in
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network Address Translation (NAT)
Daily utilization metrics for data.lacity.org and geohub.lacity.org. Updated monthly