Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 1 row and is filtered where the books is The economics of immigration : selected papers of Barry R. Chiswick. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the supplementary materials (Supplementary_figures.docx, Supplementary_tables.docx) of the manuscript: "Spatio-temporal dynamics of attacks around deaths of wolves: A statistical assessment of lethal control efficiency in France". This repository also provides the R codes and datasets necessary to run the analyses described in the manuscript.
The R datasets with suffix "_a" have anonymous spatial coordinates to respect confidentiality. Therefore, the preliminary preparation of the data is not provided in the public codes. These datasets, all geolocated and necessary to the analyses, are:
Attack_sf_a.RData: 19,302 analyzed wolf attacks on sheep
ID: unique ID of the attack
DATE: date of the attack
PASTURE: the related pasture ID from "Pasture_sf_a" where the attack is located
STATUS: column resulting from the preparation and the attribution of attacks to pastures (part 2.2.4 of the manuscript); not shown here to respect confidentiality
Pasture_sf_a.RData: 4987 analyzed pastures grazed by sheep
ID: unique ID of the pasture
CODE: Official code in the pastoral census
FLOCK_SIZE: maximum annual number of sheep grazing in the pasture
USED_MONTHS: months for which the pasture is grazed by sheep
Removal_sf_a.RData: 232 analyzed single wolf removal or groups of wolf removals
ID: unique ID of the removal
OVERLAP: are they single removal ("non-interacting" in the manuscript => "NO" here), or not ("interacting" in the manuscrit, here "SIMULTANEOUS" for removals occurring during the same operation or "NON-SIMULTANEOUS" if not).
DATE_MIN: date of the single removal or date of the first removal of a group
DATE_MAX: date of the single removal or date of the last removal of a group
CLASS: administrative type of the removal according to definitions from 2.1 part of the manuscript
SEX: sex or sexes of the removed wolves if known
AGE: class age of the removed wolves if known
BREEDER: breeding status of the removed female wolves, "Yes" for female breeder, "No" for female non-breeder. Males are "No" by default, when necropsied; dead individuals with NA were not found.
SEASON: season of the removal, as defined in part 2.3.4 of the manuscript
MASSIF: mountain range attributed to the removal, as defined in part 2.3.4 of the manuscript
Area_to_exclude_sf_a.RData: one row for each mountain range, corresponding to the area where removal controls of the mountain range could not be sampled, as defined in part 2.3.6 of the manuscript
These datasets were used to run the following analyses codes:
Code 1 : The file Kernel_wolf_culling_attacks_p.R contains the before-after analyses.
We start by delimiting the spatio-temporal buffer for each row of the "Removal_sf_a.RData" dataset.
We identify the attacks from "Attack_sf_a.RData" within each buffer, giving the data frame "Buffer_df" (one row per attack)
We select the pastures from "Pasture_sf_a.RData" within each buffer, giving the data frame "Buffer_sf" (one row per removal)
We calculate the spatial correction
We spatially slice each buffer into 200 rings, giving the data frame "Ring_sf" (one row per ring)
We add the total pastoral area of the ring of the attack ("SPATIAL_WEIGHT"), for each attack of each buffer, within Buffer_df ("Buffer_df.RData")
We calculate the pastoral correction
We create the pastoral matrix for each removal, giving a matrix of 200 rows (one for each ring) and 180 columns (one for each day, 90 days before the removal date and 90 day after the removal date), with the total pastoral area in use by sheep for each corresponding cell of the matrix (one element per removal, "Pastoral_matrix_lt.RData")
We simulate, for each removal, the random distribution of the attacks from "Buffer_df.RData" according to "Pastoral_matrix_lt.RData". The process is done 100 times (one element per simulation, "Buffer_simulation_lt.RData").
We estimate the attack intensities
We classified the removals into 20 subsets, according to part 2.3.4 of the manuscript ("Variables_lt.RData") (one element per subset)
We perform, for each subset, the kernel estimations with the observed attacks ("Kernel_lt.RData"), with the simulated attacks ("Kernel_simulation_lt.RData") and we correct the first kernel computations with the second ("Kernel_controlled_lt.RData") (one element per subset).
We calculate the trend of attack intensities, for each subset, that compares the total attack intensity before and after the removals (part 2.3.5 of the manuscript), giving "Trends_intensities_df.RData". (one row per subset)
We calculate the trend of attack intensities, for each subset, along the spatial axis, three times, one for each time analysis scale. This gives "Shift_df" (one row per ring and per time analysis scale.
Code 2 : The file Control_removals_p.R contains the control-impact analyses.
It starts with the simulation of 100 removal control sets ("Control_sf_lt_a.RData") from the real set of removals ("Removal_sf_a.RData"), that is done with the function "Control_fn" (l. 92).
The rest of the analyses follows the same process as in the first code "Kernel_wolf_culling_attacks_p.R", in order to apply the before-after analyses to each control set. All objects have the same structure as before, except that they are now a list, with one resulting element per control set. These objects have "control" in their names (not to be confused with "controlled" which refers to the pastoral correction already applied in the first code).
The code is also applied again, from l. 92 to l. 433, this time for the real set of removals (l. 121) - with "Simulated = FALSE" (l. 119). We could not simply use the results from the first code because the set of removals is restricted to removals attributed to mountain ranges only. There are 2 resulting objects: "Kernel_real_lt.RData" (observed real trends) and "Kernel_controlled_real_lt.RData" (real trends corrected for pastoral use).
The part of the code from line 439 to 524 relates to the calculations of the trends (for the real set and the control sets), as in the first code, giving "Trends_intensities_real_df.RData" and "Trends_intensities_control_lt.RData".
The part of the code from line 530 to 588 relates to the calculation of the 95% confidence intervals and the means of the intensity trends for each subset based on the results of the 100 control sets (Trends_intensities_mean_control_df.RData, Trends_intensities_CImin_control_df.RData and Trends_intensities_CImax_control_df.RData). This will be used to test the significativity of the real trends. This comparison is done right after, l. 595-627, and gives the data frame "Trends_comparison_df.RData".
Code 3 : The file Figures.R produces part of the figures from the manuscript:
"Dataset map": figure 1
"Buffer": figure 2 (then pasted in powerpoint)
"Kernel construction": figure 5 (then pasted in powerpoint)
"Trend distributions": figure 7
"Kernels": part of figures 10 and S2
"Attack shifts": figure 9 and S1
"Significant": figure 8
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is The wit & wisdom of Tommy Dewar : a selection from the speeches of Sir Thomas R. Dewar, whisky baron. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterThis file contains the data set used to develop a random forest model predict background specific conductivity for stream segments in the contiguous United States. This Excel readable file contains 56 columns of parameters evaluated during development. The data dictionary provides the definition of the abbreviations and the measurement units. Each row is a unique sample described as R** which indicates the NHD Hydrologic Unit (underscore), up to a 7-digit COMID, (underscore) sequential sample month. To develop models that make stream-specific predictions across the contiguous United States, we used StreamCat data set and process (Hill et al. 2016; https://github.com/USEPA/StreamCat). The StreamCat data set is based on a network of stream segments from NHD+ (McKay et al. 2012). These stream segments drain an average area of 3.1 km2 and thus define the spatial grain size of this data set. The data set consists of minimally disturbed sites representing the natural variation in environmental conditions that occur in the contiguous 48 United States. More than 2.4 million SC observations were obtained from STORET (USEPA 2016b), state natural resource agencies, the U.S. Geological Survey (USGS) National Water Information System (NWIS) system (USGS 2016), and data used in Olson and Hawkins (2012) (Table S1). Data include observations made between 1 January 2001 and 31 December 2015 thus coincident with Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (https://modis.gsfc.nasa.gov/data/). Each observation was related to the nearest stream segment in the NHD+. Data were limited to one observation per stream segment per month. SC observations with ambiguous locations and repeat measurements along a stream segment in the same month were discarded. Using estimates of anthropogenic stress derived from the StreamCat database (Hill et al. 2016), segments were selected with minimal amounts of human activity (Stoddard et al. 2006) using criteria developed for each Level II Ecoregion (Omernik and Griffith 2014). Segments were considered as potentially minimally stressed where watersheds had 0 - 0.5% impervious surface, 0 – 5% urban, 0 – 10% agriculture, and population densities from 0.8 – 30 people/km2 (Table S3). Watersheds with observations with large residuals in initial models were identified and inspected for evidence of other human activities not represented in StreamCat (e.g., mining, logging, grazing, or oil/gas extraction). Observations were removed from disturbed watersheds, with a tidal influence or unusual geologic conditions such as hot springs. About 5% of SC observations in each National Rivers and Stream Assessment (NRSA) region were then randomly selected as independent validation data. The remaining observations became the large training data set for model calibration. This dataset is associated with the following publication: Olson, J., and S. Cormier. Modeling spatial and temporal variation in natural background specific conductivity. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 53(8): 4316-4325, (2019).
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent decades, unfavorable solubility of novel therapeutic agents is considered as an important challenge in pharmaceutical industry. Supercritical carbon dioxide (SCCO2) is known as a green, cost-effective, high-performance, and promising solvent to develop the low solubility of drugs with the aim of enhancing their therapeutic effects. The prominent objective of this study is to improve and modify disparate predictive models through artificial intelligence (AI) to estimate the optimized value of the Oxaprozin solubility in SCCO2 system. In this paper, three different models were selected to develop models on a solubility dataset. Pressure (bar) and temperature (K) are the two inputs for each vector, and each vector has one output (solubility). Selected models include NU-SVM, Linear-SVM, and Decision Tree (DT). Models were optimized through hyper-parameters and assessed applying standard metrics. Considering R-squared metric, NU-SVM, Linear-SVM, and DT have scores of 0.994, 0.854, and 0.950, respectively. Also, they have RMSE error rates of 3.0982E-05, 1.5024E-04, and 1.1680E-04, respectively. Based on the evaluations made, NU-SVM was considered as the most precise method, and optimal values can be summarized as (T = 336.05 K, P = 400.0 bar, solubility = 0.00127) employing this model. Fig 4
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.
Note that the Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on. A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered.
The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:
The Olympic data on www.sports-reference.com is the result of an incredible amount of research by a group of Olympic history enthusiasts and self-proclaimed 'statistorians'. Check out their blog for more information. All I did was consolidated their decades of work into a convenient format for data analysis.
This dataset provides an opportunity to ask questions about how the Olympics have evolved over time, including questions about the participation and performance of women, different nations, and different sports and events.
Facebook
TwitterThe lack of publicly available National Football League (NFL) data sources has been a major obstacle in the creation of modern, reproducible research in football analytics. While clean play-by-play data is available via open-source software packages in other sports (e.g. nhlscrapr for hockey; PitchF/x data in baseball; the Basketball Reference for basketball), the equivalent datasets are not freely available for researchers interested in the statistical analysis of the NFL. To solve this issue, a group of Carnegie Mellon University statistical researchers including Maksim Horowitz, Ron Yurko, and Sam Ventura, built and released nflscrapR an R package which uses an API maintained by the NFL to scrape, clean, parse, and output clean datasets at the individual play, player, game, and season levels. Using the data outputted by the package, the trio went on to develop reproducible methods for building expected point and win probability models for the NFL. The outputs of these models are included in this dataset and can be accessed using the nflscrapR package.
The dataset made available on Kaggle contains all the regular season plays from the 2009-2016 NFL seasons. The dataset has 356,768 rows and 100 columns. Each play is broken down into great detail containing information on: game situation, players involved, results, and advanced metrics such as expected point and win probability values. Detailed information about the dataset can be found at the following web page, along with more NFL data: https://github.com/ryurko/nflscrapR-data.
This dataset was compiled by Ron Yurko, Sam Ventura, and myself. Special shout-out to Ron for improving our current expected points and win probability models and compiling this dataset. All three of us are proud founders of the Carnegie Mellon Sports Analytics Club.
This dataset is meant to both grow and bring together the community of sports analytics by providing clean and easily accessible NFL data that has never been availabe on this scale for free.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
About this file The Kaggle Global Superstore dataset is a comprehensive dataset containing information about sales and orders in a global superstore. It is a valuable resource for data analysis and visualization tasks. This dataset has been processed and transformed from its original format (txt) to CSV using the R programming language. The original dataset is available here, and the transformed CSV file used in this analysis can be found here.
Here is a description of the columns in the dataset:
category: The category of products sold in the superstore.
city: The city where the order was placed.
country: The country in which the superstore is located.
customer_id: A unique identifier for each customer.
customer_name: The name of the customer who placed the order.
discount: The discount applied to the order.
market: The market or region where the superstore operates.
ji_lu_shu: An unknown or unspecified column.
order_date: The date when the order was placed.
order_id: A unique identifier for each order.
order_priority: The priority level of the order.
product_id: A unique identifier for each product.
product_name: The name of the product.
profit: The profit generated from the order.
quantity: The quantity of products ordered.
region: The region where the order was placed.
row_id: A unique identifier for each row in the dataset.
sales: The total sales amount for the order.
segment: The customer segment (e.g., consumer, corporate, or home office).
ship_date: The date when the order was shipped.
ship_mode: The shipping mode used for the order.
shipping_cost: The cost of shipping for the order.
state: The state or region within the country.
sub_category: The sub-category of products within the main category.
year: The year in which the order was placed.
market2: Another column related to market information.
weeknum: The week number when the order was placed.
This dataset can be used for various data analysis tasks, including understanding sales patterns, customer behavior, and profitability in the context of a global superstore.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 1 row and is filtered where the books is The economics of immigration : selected papers of Barry R. Chiswick. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.