19 datasets found

Data Mining Project - Boston
kaggle.com
zip
Updated Nov 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
Explore at:
zip(59313797 bytes)Available download formats
Dataset updated
Nov 25, 2019
Authors
SophieLiu
Area covered
Boston
Description
Context

To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

Use of Data Files

You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

This loads the file into R

df<-read.csv('uber.csv')

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

df_black<-subset(uber_df, uber_df$name == 'Black')

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

getwd()

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Kickastarter Campaigns
kaggle.com
zip
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessio Cantara (2024). Kickastarter Campaigns [Dataset]. https://www.kaggle.com/datasets/alessiocantara/kickastarter-project/discussion
Explore at:
zip(2233314 bytes)Available download formats
Dataset updated
Jan 25, 2024
Authors
Alessio Cantara
Description
Welcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.

ASK

Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?

PREPARE

I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign

Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.

Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.

PROCESS

I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries: R library(tidyverse) library(lubridate) library(ggplot2) library(dplyr) library(tidyr) I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:

csv_files <- list.files(pattern = "\\.csv$") data_frames <- list() for (file in csv_files) { variable_name <- sub("\\.csv$", "", file) assign(variable_name, read.csv(file)) data_frames[[variable_name]] <- get(variable_name) }

Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.

data_frames <- lapply(data_frames, function(df) { df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount) return(df) }) data_frames <- lapply(data_frames, function(df) { df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate) return(df) }) data_frames <- lapply(data_frames, function(df) { df$usd_pledged <- as.numeric(df$usd_pledged) return(df) })

In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):

all_nov_2023 = bind_rows(data_frames) all_dec_2023 = bind_rows(data_frames) all_jan_2024 = bind_rows(data_frames)`

After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.

filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>% filter(created_at != 0 & deadline != 0 & launched_at != 0) %>% mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>% mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>% mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>% mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>% select(-category, -deadline, -launched_at, -created_at) %>% relocate(created, launched, setted_deadline, .before = goal) write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)

The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...
r
Myrstener et al. (2025) Downstream temperature effects of boreal forest...
researchdata.se
su.figshare.com
Updated Feb 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caroline Greiser; Lenka Kuglerová; Maria Myrstener (2025). Myrstener et al. (2025) Downstream temperature effects of boreal forest clearcutting vary with riparian buffer width - Data and Code [Dataset]. http://doi.org/10.17045/STHLMUNI.27188004
Explore at:
Unique identifier
https://doi.org/10.17045/STHLMUNI.27188004
Dataset updated
Feb 17, 2025
Dataset provided by
Stockholm University
Authors
Caroline Greiser; Lenka Kuglerová; Maria Myrstener
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please read the readme.txt !

This depository contains raw and clean data (.csv), as well as the R-scripts (.r) that process the data, create the plots and the models.

We recommend to go through the R-scripts in their chronological order.

Code was developed in the R software:

R version 4.4.1 (2024-06-14 ucrt) -- "Race for Your Life" Copyright (C) 2024 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64

****** List of files ********************************

Data

---raw

72 files from 72 Hobo data loggers

names: site_position_medium.csv

example: "20_20_down_water.csv" (site = 20, position = 20 m downstream, medium = water)

---clean

site_logger_position_medium.csv list of all sites, their loggers, their position and medium in which they were placed

loggerdata_compiled.csv all raw logger data (see above) compiled into one dataframe, for column names see below

Daily_loggerdata.csv all data aggregated to daily mean, max and min values, for column names see below

CG_site_distance_pairs.csv all logger positions for each stream and their pairwise geographical distance in meters

Discharge_site7.csv Discharge data for the same season as logger data from a reference stream

buffer_width_eniro_CG.csv measured and averaged buffer widths for each of the studied streams (in m)

Scripts

01_compile_clean_loggerdata.r

02_aggregate_loggerdata.r

03_model_stream_temp_summer.r

03b_model_stream_temp_autumn.r

04_calculate_warming_cooling_rates_summer.r

04b_calculate_warming_cooling_rates_autumn.r

05_model_air_temp_summer.r

05b_model_air_temp_autumn.r

06_plot_representative_time_series_temp_discharge.r

****** Column names ********************************

Most column names are self explaining, and are also explained in the R code.

Below some detailed info on two dataframes (.csv) - the column names are similar in other csv files

File "loggerdata_compiled.csv" [in Data/clean/ ]

"Logger.SN" Logger serial number

"Timestamp" Datetime, YYYY-MM-DD HH:MM:SS

"Temp" temperature in °C

"Illum" light in lux

"Year" YYYY

"Month" MM

"Day" DD

"Hour" HH

"Minute" MM

"Second" SS

"tz" time zone

"path" file path

"site" stream/site ID

"file" file name

"medium" "water" or "air"

"position" one of 6 positions along the stream: up, mid, end, 20, 70, 150

"date" YYYY-MM-DD

File "Daily_loggerdata.csv" [in Data/clean/ ]

"date" ... (see above)

"Logger.SN" Logger serial number

"mean_temp" mean daily temperature

"min_temp" minimum daily temperature

"max_temp" maximum daily temperature

"path" ...

"site" ...

"file" ...

"medium" ...

"position" ...

"buffer" one of 3 buffer categories: no, thin, wide

"Temp.max.ref" maximum daily temperature of the upstream reference logger

"Temp.min.ref" minimum daily temperature of the upstream reference logger

"Temp.mean.ref" mean daily temperature of the upstream reference logger

"Temp.max.dev" max. temperature difference to upstream reference

"Temp.min.dev" min. temperature difference to upstream reference

"Temp.mean.dev" mean temperature difference to upstream reference

Paper abstract:

Clearcutting increases temperatures of forest streams, and in temperate zones, the effects can extend far downstream. Here, we studied whether similar patterns are found in colder, boreal zones and if riparian buffers can prevent stream water from heating up. We recorded temperature at 45 locations across nine streams with varying buffer widths. In these streams, we compared upstream (control) reaches with reaches in clearcuts and up to 150 m downstream. In summer, we found daily maximum water temperature increases on clearcuts up to 4.1 °C with the warmest week ranging from 12.0 to 18.6 °C. We further found that warming was sustained downstream of clearcuts to 150 m in three out of six streams with buffers < 10 m. Surprisingly, temperature patterns in autumn resembled those in summer, yet with lower absolute temperatures (maximum warming was 1.9 °C in autumn). Clearcuts in boreal forests can indeed warm streams, and because these temperature effects are propagated downstream, we risk catchment-scale effects and cumulative warming when streams pass through several clearcuts. In this study, riparian buffers wider than 15 m protected against water temperature increases; hence, we call for a general increase of riparian buffer width along small streams in boreal forests.

Google Data Analytics Case Study Cyclistic

kaggle.com

zip

Updated Sep 27, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions

Explore at:

zip(1299 bytes)Available download formats

Dataset updated

Sep 27, 2022

Authors

Udayakumar19

Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Guiding Quesions:

Were you able to answer the question of how ...

Data for analysis in Barrie et al. (2025)
figshare.com
csv
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eleanor Barrie; Luke L. Powell; Billi Krochuk; Patricia F Rodrigues; Jared D Wolfe; Crinan Jarrett; Diogo F Ferreira; Kristin E Brzeski; Jacob C Cooper; Susana Lin Mufumu; Silvestre Esteban Malanza; Agustin Ebana Nsue Akele; Cayetano Ebana Ebana Alene (2025). Data for analysis in Barrie et al. (2025) [Dataset]. http://doi.org/10.6084/m9.figshare.29114960.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29114960.v1
Dataset updated
May 21, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Eleanor Barrie; Luke L. Powell; Billi Krochuk; Patricia F Rodrigues; Jared D Wolfe; Crinan Jarrett; Diogo F Ferreira; Kristin E Brzeski; Jacob C Cooper; Susana Lin Mufumu; Silvestre Esteban Malanza; Agustin Ebana Nsue Akele; Cayetano Ebana Ebana Alene
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Barrie
Description
These files contain the data used for analysis in: Barrie EM, Krochuk BA, Jarrett C, Ferreira DF, Rodrigues P, Mufumu SL, Malanza SE, Akele AEN, Alene CEE, Brzeski KE, Cooper JC, Wolfe JD and Powell LL (2025) Specialized insectivores drive differences in avian community composition between primary and secondary forest in Central Africa. Front. Conserv. Sci. 6:1504350. doi: 10.3389/fcosc.2025.1504350At a long-term bird banding station on mainland Equatorial Guinea, we captured over 3200 birds across 6 field seasons in selectively logged secondary forest and in largely undisturbed primary forest. Our objective was to understand how community composition changed with human disturbance—with particular interest in the guilds and species that indicate primary rainforest.banding_data.csv consists of the raw banding/capture data from mist-netting and ringing in the field, including info on time and date of capture, net lane and net number, species, ring number, and recaptures.buffers.csv lists (for each net lane) the amount of overlap with other nearby net lanes and the proportion used for the offset in statistical analysis. See Barrie et al. (2025) for methodology.days.csv lists all combinations of net lanes and dates run and whether these were "Day 1" or "Day 2" (all net lanes were run for two consecutive days per year.effort.csv contains data on effort in terms of mist net hours, with the opening and closing times and duration open for every net run.forest_type.csv lists each net lane and whether it was in primary or secondary forestguilds.csv contains data on the dietary guild classifications of all focal species analysed in Barrie et al. (2025), which is needed to merge with banding_data.csv in R and create the data frame for analysis
r
Data for the Farewell and Herberg example of a two-phase experiment using a...
researchdata.edu.au
datasetcatalog.nlm.nih.gov
+1more
Updated Jul 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Brien (2021). Data for the Farewell and Herberg example of a two-phase experiment using a plaid design [Dataset]. http://doi.org/10.25909/13122095
Explore at:
Unique identifier
https://doi.org/10.25909/13122095
Dataset updated
Jul 1, 2021
Dataset provided by
The University of Adelaide
Authors
Chris Brien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The experiment that Farewell and Herzberg (2003) describe is pain-rating experiment that is a subset of the experiment reported by Solomon et al. (1997). It is a two-phase experiment. The first phase is a self-assessment phase in which patients self-assess for pain while moving a painful shoulder joint. The second phase of this experiment is an evaluation phase in which occupational and physical therapy students (the raters) are evaluated for rating patients in a set of videos for pain. The measured response is the difference between a student rating and the patient's rating.

The R data file plaid.dat.rda contains the data.frame plaid.dat that has a revised version of the data for the Farewell and Herzberg example downloaded from https://doi.org/10.17863/CAM.54494. The comma delimited text file plaid.dat.csv has the same information in this more commonly accepted format, but without the metadata associated with the data.frame<\CODE>. The data.frame contains the factors Raters, Viewings, Trainings, Expressiveness, Patients, Occasions, and Motions and a column for the response variable Y. The two factors Viewings and Occasions are additional to those in the downloaded file and the remaining factors have been converted from integers or characters to factors and renamed to the names given above. The column Y is unchanged from the column in the original file. To load the data in R use: load("plaid.dat.rda") or plaid.dat <- read.csv(file = "plaid.dat.csv").
References
Farewell, V. T.,& Herzberg, A. M. (2003). Plaid designs for the evaluation of training for medical practitioners. Journal of Applied Statistics, 30(9), 957-965. https://doi.org/10.1080/0266476032000076092
Solomon, P. E., Prkachin, K. M. & Farewell, V. (1997). Enhancing sensitivity to facial expression of pain. Pain, 71(3), 279-284. https://doi.org/10.1016/S0304-3959(97)03377-0
The CORESIDENCE Database: National and Subnational Data on Household and...
data.europa.eu
zenodo.org
unknown
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo, The CORESIDENCE Database: National and Subnational Data on Household and Living Arrangements Around the World, 1964-2021 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8142652?locale=hu
Explore at:
unknown(18275)Available download formats
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Households are the fundamental units of co-residence and play a crucial role in social and economic reproduction worldwide. They are also widely used as units of enumeration for data collection purposes, with substantive implications for research on poverty, living conditions, family structure, and gender dynamics. However, reliable comparative data on households and changes and living arrangements around the world is still under development. The CORESIDENCE database (CoDB) aims to bridge the existing data gap by offering valuable insights not only into the documented disparities between countries but also into the often-elusive regional differences within countries. By providing comprehensive data, it facilitates a deeper understanding of the complex dynamics of co-residence around the world. This database is a significant contribution to research, as it sheds light on both macro-level variations across nations and micro-level variations within specific regions, facilitating more nuanced analyses and evidence-based policymaking. The CoDB is composed of three datasets covering 155 countries (National Dataset), 3563 regions (Subnational Dataset), and 1511 harmonized regions (Subnational-Harmonized Dataset) for the period 1960 to 2021, and it provides 146 indicators on household composition and family arrangements across the world. This repository is composed of the following elements: a RData file named CORESIDENDE_DATABASE containing the CoDB in the form of a List. The CORESIDENDE_DB list object is composed of six elements: NATIONAL: a data frame with the household composition and living arrangements indicators at the national level. SUBNATIONAL: a data frame with the household composition and living arrangements indicators at the subnational level computed over the original subnational division provided in each sample and data source. SUBNATIONAL_HARMONIZED: a data frame with the household composition and living arrangements indicators computed over the harmonized subnational regions. SUBNATIONAL_BOUNDARIES_CORESIDENCE: a spatial data frame (a sf object) with the boundary’s delimitation of the subnational harmonized regions created for this project. CODEBOOK: a data frame with the complete list of indicators, their code names and description. HARMONIZATION_TABLE: a data frame with the full list of individual country-year samples employed in this project and their state of inclusion in the 3 datasets composing the CoDB. Elements 1, 2, 3, 5 and 6 of the R list are also provided as csv files under the same names. Element 4, the harmonized boundaries, is at disposal as gpkg (Geopackage) file.
u
Data from: A Phanerozoic gridded dataset for palaeogeographic...
portalcientifico.uvigo.gal
data.niaid.nih.gov
+1more
Updated 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jones, Lewis A.; Domeier, Mathew; Jones, Lewis A.; Domeier, Mathew (2024). A Phanerozoic gridded dataset for palaeogeographic reconstructions [Dataset]. https://portalcientifico.uvigo.gal/documentos/668fc42bb9e7c03b01bd5735
Explore at:
Dataset updated
2024
Authors
Jones, Lewis A.; Domeier, Mathew; Jones, Lewis A.; Domeier, Mathew
Description
This repository provides access to five pre-computed reconstruction files as well as the static polygons and rotation files used to generate them. This set of palaeogeographic reconstruction files provide palaeocoordinates for three global grids at H3 resolutions 2, 3, and 4, which have an average cell spacing of ~316 km, ~119 km, and ~45 km, respectively. Grids were reconstructed at a temporal resolution of one million years throughout the entire Phanerozoic (540–0 Ma). The reconstruction files are stored as comma-separated-value (CSV) files which can be easily read by almost any spreadsheet program (e.g. Microsoft Excel and Google Sheets) or programming language (e.g. Python, Julia, and R). In addition, R Data Serialization (RDS) files—a common format for saving R objects—are also provided as lighter (and compressed) alternatives to the CSV files. The structure of the reconstruction files follows a wide-form data frame structure to ease indexing. Each file consists of three initial index columns relating to the H3 cell index (i.e. the 'H3 address'), present-day longitude of the cell centroid, and the present-day latitude of the cell centroid. The subsequent columns provide the reconstructed longitudinal and latitudinal coordinate pairs for their respective age of reconstruction in ascending order, indicated by a numerical suffix. Each row contains a unique spatial point on the Earth's continental surface reconstructed through time. NA values within the reconstruction files indicate points which are not defined in deeper time (i.e. either the static polygon does not exist at that time, or it is outside the temporal coverage as defined by the rotation file).

The following five Global Plate Models are provided (abbreviation, temporal coverage, reference) within the GPMs folder:

WR13, 0–550 Ma, (Wright et al., 2013)

MA16, 0–410 Ma, (Matthews et al., 2016)

TC16, 0–540 Ma, (Torsvik and Cocks, 2016)

SC16, 0–1100 Ma, (Scotese, 2016)

ME21, 0–1000 Ma, (Merdith et al., 2021)

In addition, the H3 grids for resolutions 2, 3, and 4 are provided within the grids folder. Finally, we also provide two scripts (python and R) within the code folder which can be used to generate reconstructed coordinates for user data from the reconstruction files.

For access to the code used to generate these files:

https://github.com/LewisAJones/PhanGrids

For more information, please refer to the article describing the data:

Jones, L.A. and Domeier, M.M. 2024. A Phanerozoic gridded dataset for palaeogeographic reconstructions. (2024).

For any additional queries, contact:

Lewis A. Jones (lewisa.jones@outlook.com) or Mathew M. Domeier (mathewd@uio.no)

If you use these files, please cite:

Jones, L.A. and Domeier, M.M. 2024. A Phanerozoic gridded dataset for palaeogeographic reconstructions. DOI: 10.5281/zenodo.10069221

References

Matthews, K. J., Maloney, K. T., Zahirovic, S., Williams, S. E., Seton, M., & Müller, R. D. (2016). Global plate boundary evolution and kinematics since the late Paleozoic. Global and Planetary Change, 146, 226–250. https://doi.org/10.1016/j.gloplacha.2016.10.002.

Merdith, A. S., Williams, S. E., Collins, A. S., Tetley, M. G., Mulder, J. A., Blades, M. L., Young, A., Armistead, S. E., Cannon, J., Zahirovic, S., & Müller, R. D. (2021). Extending full-plate tectonic models into deep time: Linking the Neoproterozoic and the Phanerozoic. Earth-Science Reviews, 214, 103477. https://doi.org/10.1016/j.earscirev.2020.103477.

Scotese, C. R. (2016). Tutorial: PALEOMAP paleoAtlas for GPlates and the paleoData plotter program: PALEOMAP Project, Technical Report.

Torsvik, T. H., & Cocks, L. R. M. (2017). Earth history and palaeogeography. Cambridge University Press. https://doi.org/10.1017/9781316225523.

Wright, N., Zahirovic, S., Müller, R. D., & Seton, M. (2013). Towards community-driven paleogeographic reconstructions: Integrating open-access paleogeographic and paleobiology data with plate tectonics. Biogeosciences, 10, 1529–1541. https://doi.org/10.5194/bg-10-1529-2013.
Countries by population 2021 (Worldometer)
kaggle.com
zip
Updated Aug 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artem Zapara (2021). Countries by population 2021 (Worldometer) [Dataset]. https://www.kaggle.com/datasets/artemzapara/countries-by-population-2021-worldometer
Explore at:
zip(8163 bytes)Available download formats
Dataset updated
Aug 16, 2021
Authors
Artem Zapara
Description
Context

This dataset is a clean CSV file with the most recent estimates of the population of the countries according to Wolrdometer. The data is taken from the following link: https://www.worldometers.info/world-population/population-by-country/

Content

The data has been generated by websraping the aforementioned link on the 16th August 2021. Below is the code used to make CSV data in Python 3.8: import requests from bs4 import BeautifulSoup import pandas as pd url = "https://www.worldometers.info/world-population/population-by-country/" r = requests.get(url) soup = BeautifulSoup(r.content) countries = soup.find_all("table")[0] dataframe = pd.read_html(str(countries))[0] dataframe.to_csv("countries_by_population_2021.csv", index=False)

Acknowledgements

The creation of this dataset would not be possible without a team of Worldometers, a data aggregation website.
Z
Data from: Code and Data Schimmelradar manuscript 1.1
data-staging.niaid.nih.gov
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kortenbosch, Hylke (2025). Code and Data Schimmelradar manuscript 1.1 [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14851614
Explore at:
Dataset updated
Apr 3, 2025
Dataset provided by
Wageningen University & Research
Authors
Kortenbosch, Hylke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Read me – Schimmelradar manuscript

The code in this repository was written to analyse the data and generate figures for the manuscript “Land use drives spatial structure of drug resistance in a fungal pathogen”.

This repository consists of two original .csv raw data files, 2 .tif files that are minimally reformatted after being downloaded from LGN.nl and www.pdok.nl/introductie/-/article/basisregistratie-gewaspercelen-brp-, and 9 scripts using the R language. The remaining files include intermediate .tif and .csv files to skip more computationally heavy steps of the analysis and facilitate the reproduction of the analysis.

Data files:§1

Schimmelradar_360_submission.csv: The raw phenotypic resistance spatial data from the air sample

Sample: an arbitrary sample code given to each of the participants

Area: A random number assigned to each of the 100 areas the Netherlands was split up into to facilitate an even spread of samples across the country during participant selection.

Logistics status: Variable used to indicate whether the sample was returned in good order, not otherwise used in the analysis.

Arrived back on: The date by which the sample arrived back at Wageningen University

Quality seals: quality of the seals upon sample return, only samples of a quality designated as good seals were processed. (also see Supplement file – section A).

Start sampling: The date on which the trap was deployed and the stickers exposed to the air, recorded by the participant

End sampling: The date on which the trap was taken down and the stickers were re-covered and no longer exposed to the air, recorded by the participant

3 back in area?: Binary indicating whether at least three samples have been returned in the respective area (see Area)

Batch: The date on which processing of the sample was started. To be more specific, the date at which Flamingo medium was poured over the seals of the sample and incubation was started.

Lab processing: Binary indication completion of lab processing

Tot ITR: A. fumigatus CFU count in the permissive layer of the itraconazole-treated plate

RES ITR: CFU count of colonies that had breached the surface of the itraconazole-treated layer after incubation and were visually (with the unaided eye) sporulating.

RF ITR: The itraconazole (~4 mg/L) resistance fraction = RES ITR/Tot ITR

Muccor ITR: Indication of the presence of Mucorales spp. growth on the itraconazole treatment plate

Tot VOR: A. fumigatus CFU count in the permissive layer of the voriconazole-treated plate

RES VOR: CFU count of colonies that had breached the surface of the voriconazole-treated layer after incubation and were visually (with the unaided eye) sporulating.

RF VOR: The voriconazole (~2 mg/L) resistance fraction = RES VOR/Tot VOR

Muccor VOR: Indication of the presence of Mucorales spp. growth on the voriconazole treatment plate

Tot CON: CFU count on the untreated growth control plate Note: note on the sample based on either information given by the participant or observations in the lab. The exclude label was given if the sample had either too little (<25) or too many (>300) CFUs on one or more of the plates (also see Supplement file – section A).

Lat: Exact latitude of the address where the sample was taken. Not used in the published version of the code and hidden for privacy reasons.

Long: Exact longitude of the address where the sample was taken. Not used in the published version of the code and hidden for privacy reasons.

Round_Lat: Rounded latitude of the address where the sample was taken. Rounded down to two decimals (the equivalent of a 1 km2 area), so they could not be linked to a specific address. Used in the published version of the code.

Round_Long: Rounded longitude of the address where the sample was taken. Rounded down to two decimals (the equivalent of a 1 km2 area), so they could not be linked to a specific address. Used in the published version of the code.

Analysis_genotypic_schimmelradar_TR_types.csv: The genotype data inferred from gel electrophoresis for all resistant isolates

TR type: Indicates the length of the tandem repeats in bp, as judged from a gel. 34 bp, 46 bp, or multiples of 46.

Plate: 96-well plate on which the isolate was cultured

96-well: well in which the isolate was cultured

Azole: Azole on which the isolate was grown and resistant to. Itraconazole (ITRA) or Voriconazole (VORI).

Sample: The air sample the isolate was taken from, corresponds to “Sample” in “Schimmelradar_360_submission.csv”.

Strata: The number that equates to “Area” in “Schimmelradar_360_submission.csv”.

WT: A binary that indicates whether an isolate had a wildtype cyp51a promotor.

TR34: A binary that indicates whether an isolate had a TR34 cyp51a promotor.

TR46: A binary that indicates whether an isolate had a TR46 cyp51a promotor.

TR46_3: A binary that indicates whether an isolate had a TR46*3 cyp51a promotor.

TR46_4: A binary that indicates whether an isolate had a TR46*4 cyp51a promotor.

Script 1 - generation_100_equisized_areas_NL

NOTE: Running this code is not necessary for the other analyses, it was used primarily for sample selection. The area distribution was used during the analysis in script 2B, yet each sample is already linked to an area in “Schimmelradar_360_submission.csv". This script was written to generate a spatial polygons data frame of 100 equisized areas of the Netherlands. The registrations for the citizen science project Schimmelradar were binned into these areas to facilitate a relatively even distribution of samples throughout the country which can be seen in Figure S1. The spatial polygons data frame can be opened and displayed in open-source software such as QGIS. The package “spcosa” used to generate the areas has RJava as a dependency, so having Java installed is required to run this script. The R script uses a shapefile of the Netherlands from the tmap package to generate the areas within the Netherlands. Generating a similar distribution for other countries will require different shape files!

Script 2 - Spatial_data_integration_fungalradar

This script produces 4 data files that describe land use in the Netherlands: The three focal.RData files with land use and resistant/colony counts, as well as the “Predictor_raster_NL.tif” land use file.

In this script, both the phenotypic and genotypic resistance spatial data from the air samples taken during the Fungal radar citizen science project are integrated with the land use and weather data used to model them. It is not recommended to run this code because the data extraction is fairly computationally demanding and it does not itself contain key statistical analyses. Rather it is used to generate the objects used for modelling and spatial predictions that are also included in this repository.

The phenotypic resistance is summarised in Table 1, which is generated in this script. Subsequently, the spatial data from the LNG22 and BRP datasets are integrated into the data. These dataset can be loaded from the "LGN2022.tif" and "Gewas22rast.tiff" raster files, respectively. Link to webpages where these files can be downloaded can found in the code.

Once the raster files are loaded, the code generates heatmaps and calculates the proportions of all the land use classes in both a 5 and 10-km radius around every sample and across the country to make spatial predictions. Only the 10 km radius data are used in the later analysis, but the 5 km radius was generated to test if that radius would be more appropriate, during an earlier stage of the analyses, and was left in for completeness. For documentation of the LGN22 data set, we refer to https://lgn.nl/documentatie and for BRP to https://nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/44e6d4d3-8fc5-47d6-8712-33dd6d244eef, both of these online resources are in Dutch but can be readily translated. A list of the variables that were included from these datasets during model selection can be found in Table S3. Alongside land-use data, the code extracts weather data from datafiles that can be downloaded from https://cds.climate.copernicus.eu/datasets/sis-agrometeorological-indicators?tab=download for the Netherlands during the sampling window, dates and dimensions are listed within the code. The Weather_schimmelradar folder contains four subfolders for each weather variable that was considered during modelling: temperature, wind speed, precipitation and humidity. Each of these subfolders contains 44 .nc files that each cover the daily mean of the respective weather variable across the Netherlands for each of the 44 days of the sampling window the citizen scientists were given.

All spatial objects weather + land use are eventually merged into one predictor raster "Predictor_raster_NL.tif". The land use fractions and weather data are subsequently integrated with the air sample data into a single spatial data frame along with the resistance data and saved into an R object "Schimmelradar360spat_focal.RData". The script concludes by merging the cyp51a haplotype data with this object as well, to create two different objects: "Schimmelradar360spat_focal_TR_VORI.RData" for the haplotype data of the voriconazole resistant isolates and "Schimmelradar360spat_focal_TR_ITRA.RData" including the haplotype data of itraconazole resistant isolates. These two datasets are modeled separately in scripts 5,9 and 6,8, respectively. This final section of the script also generates summary table S2, which summarises the frequency of the cyp51a haplotypes per azole treatment.

If the relevant objects are loaded
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Z
Data from: Earth surface evolution: a Phanerozoic gridded dataset of Global...
datasetcatalog.nlm.nih.gov
zenodo.org
Updated Nov 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jones, Lewis A.; Domeier, Mathew (2023). Earth surface evolution: a Phanerozoic gridded dataset of Global Plate Model reconstructions [Dataset]. http://doi.org/10.5281/zenodo.10069222
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10069222
Dataset updated
Nov 3, 2023
Authors
Jones, Lewis A.; Domeier, Mathew
Area covered
Earth
Description
This repository provides access to five reconstruction files as well as the code and the static polygons and rotation files used to generate them. This set of palaeogeographic reconstruction files provide palaeocoordinates for three global grids at H3 resolutions 2, 3, and 4, which have an average cell spacing of ~316 km, ~119 km, and ~45 km. Grids were reconstructed at a temporal resolution of one million years throughout the entire Phanerozoic (540–0 Ma). The reconstruction files are stored as comma-separated-value (CSV) files which can be easily read by almost any spreadsheet program (e.g. Microsoft Excel and Google Sheets) or programming language (e.g. Python, Julia, and R). In addition, R Data Serialization (RDS) files—a common format for saving R objects—are also provided as lighter (and compressed) alternatives to the CSV files. The structure of the reconstruction files follows a wide-form data frame structure to ease indexing. Each file consists of three initial index columns relating to the H3 cell index (i.e. the 'H3 address'), present-day longitude of the cell centroid, and the present-day latitude of the cell centroid. The subsequent columns provide the reconstructed longitudinal and latitudinal coordinate pairs for their respective age of reconstruction in ascending order, indicated by a numerical suffix. Each row contains a unique spatial point on the Earth's continental surface reconstructed through time. NA values within the reconstruction files indicate points which are not defined in deeper time (i.e. either the static polygon does not exist at that time, or it is outside the temporal coverage as defined by the rotation file). The following five Global Plate Models are provided (abbreviation, temporal coverage, reference): WR13, 0–550 Ma, (Wright et al., 2013) MA16, 0–410 Ma, (Matthews et al., 2016) TC16, 0–540 Ma, (Torsvik and Cocks, 2016) SC16, 0–1100 Ma, (Scotese, 2016) ME21, 0–1000 Ma, (Merdith et al., 2021) In addition, the H3 grids for resolutions 2, 3, and 4 are provided. For more information, please refer to the article describing the data: Jones, L.A. and Domeier, M.M. 2023. Earth surface evolution: a Phanerozoic gridded dataset of Global Plate Model reconstructions. (TBC). For any additional queries, contact: Mathew M. Domeier (mathewd@uio.no) or Lewis A . Jones (lewisa.jones@outlook.com) If you use these files, please cite: Jones, L.A. and Domeier, M.M. 2023. Earth surface evolution: a Phanerozoic gridded dataset of Global Plate Model reconstructions. Zenodo data repository. DOI:10.5281/zenodo.10069222

Supplement 1. Details of the spatially-explicit capture–recapture model,...

wiley.figshare.com

html

Updated Jun 1, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Christopher M. Wojan; Shannon M. Knapp; Karen E. Mabry (2023). Supplement 1. Details of the spatially-explicit capture–recapture model, including R code and all data files. [Dataset]. http://doi.org/10.6084/m9.figshare.3562398.v1

Explore at:

htmlAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3562398.v1

Dataset updated

Jun 1, 2023

Dataset provided by

Wiley

Authors

Christopher M. Wojan; Shannon M. Knapp; Karen E. Mabry

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

File List SECR_Wojan_MultiSession_v7_1.R (MD5: de06a7bd07f3d7feb96d1e9c69cb05b8) R code for SECR analysis MultiSession20131112captures.txt (MD5: 7dab2facef2bb179cacc4dcbdd8abcc2) mark-recapture data for population density estimates MultiSession20131112sessions.txt (MD5: d3f25b235e111124755bfd4fa6b3b1c0) data identifying trap sessions MultiSession20131112traps.txt (MD5: f2df71eaf8eb8e29a3bdd3aba49c4560) data identifying X,Y UTM coordinates of trap locations NatalSettleLocations20131118.csv (MD5: d1dd10122476ff05f1092369ca619483) X,Y UTM coordinates of pre- and post-dispersal locations of radio-tracked animals NatalSettleLocations20131118long.csv (MD5: 94e43854158cfb8ba61d7ab3752f2d7d) X,Y UTM coordinates of pre- and post-dispersal locations of radio-tracked animals in long format NatalSettleLocationsSet2_20140103.csv (MD5: ) X,Y UTM coordinates of pre- and post-dispersal locations of trapped animals NatalSettleLocationsSet2_20140103long.csv (MD5: ) X,Y UTM coordinates of pre- and post-dispersal locations of trapped animals in long format SECR_Wojan_MultiSession_v7_1_20140707_out.txt (MD5: 98d356356211c7b780bff18a985dd604) R code to create an output file holding the results of the SECR analysis

  Description
    SECR_Wojan_MultiSession_v7_1.R

This is the R script used to run the spatially-explicit capture-recapture (SECR) analysis. Except for changing directory references, this file (along with the accompanying 3 data files) could be used to recreate the analysis used in this manuscript. MultiSession20131112captures.txt This file contains the mark-recapture information for the individual mice used in the population density estimates (focal dispersers are not included in this file). The variables are:

     trap – this is the unique number for each trap. These numbers correspond to the variable "trapID" in the file MultiSession20131112traps.txt, which gives the (X,Y) coordinates for each trap number. Coordinates are in UTM Zone 10. 
     id – this is a unique identifier to each individual mouse used in the population density estimates. These do not include or correspond to any of the dispersing mice used in the analysis.
     session – this is a uniquely coded time period corresponding to a multi-day trapping session. See the variable "sessionID" in the file MultiSession20131112sessions.txt for the information for each session including the number of trap nights and date.

    MultiSession20131112sessions.txt
     This file contains data identifying each session (the variable "session" in the file MultiSession20131112captures.txt).
    The variables are:

     sessionID – these are unique identifiers for each multi-day trapping session and correspond to the variable "session" in the file MultiSession20131112captures.txt
     nights – this is the number of trapping nights for each trapping session (either 2 or 3)
     year – this is the year during which that trapping session was conducted.
     month – this is the month during which the first night of the trapping session was conducted, 1=January, 2=February, 3=March, etc.
     firstday – this is the date of the month for the first night of the multiple nights of trapping for that trapping session

   For example, sessionID=1 was for 3 nights beginning on 23 June 2004.
    MultiSession20131112traps.txt
    This file contains the trap location data (this becomes the R data frame "trapcoords")
   The variables are:

     trapID – these are simply unique identifiers (1 to 43) for each trap and correspond to the variable "trap" in the file MultiSession20131112captures.txt
     UTMx – this is the x-axis UTM coordinate for each trap (UTM Zone 10)
     UTMy – this is the y-axis UTM coordinate for each trap (UTM Zone 10)

    NatalSettleLocations20131118.csv
    This file contains the UTM coordinates (Zone 10) for each disperser from the radio-tracked data set ("Set 1").
     The variables are:

     ID - this is a unique identifier for a dispersing individual
     natalX – this is the x-axis UTM coordinate of the disperser at its natal site
     natalY – this is the y-axis UTM coordinate of the disperser at its natal site
     natalsession - this identifies the trapping session for the natal period and corresponds to the variable "session" in the file MultiSession20131112captures.txt and the variable "sessionID" in the file MultiSession20131112sessions.txt
     settleX - this is the x-axis UTM coordinate of the disperser at its settlement site
     settleY - this is the y-axis UTM coordinate of the disperser at its settlement site
     settlesession - this identifies the trapping session for the settlement period and corresponds to the variable "session" in the file MultiSession20131112captures.txt and the variable "sessionID" in the file MultiSession20131112sessions.txt
     set - this identifies that these dispersers belong to "Set 1" (the radio-tracked set of dispersers)

    NatalSettleLocations20131118long.csv
    This file contains the UTM coordinates (Zone 10) for each disperser from the radio-tracked data set ("Set 1").
     The variables are:

     Period – this identifies if the location is the dispersers "natal" or "settle" location
     ID – this is a unique identifier for a dispersing individual
     UTMx – this is the x-axis UTM coordinate for the location (natal/settle)
     UTMy – this is the y-axis UTM coordinate for the location (natal/settle)
     session – this identifies the trapping session and corresponds to the variable "session" in the file MultiSession20131112captures.txt and the variable "sessionID" in the file MultiSession20131112sessions.txt
     set – this identifies that these dispersers belong to "Set 1" (the  radio-tracked set of dispersers)

     NatalSettleLocationsSet2_20140103.csv
     This file contains the UTM coordinates (Zone 10) for each disperser from the live-trapped data set ("Set 2").
     The variables are:

     ID - this is a unique identifier for a dispersing individual
     natalX – this is the x-axis UTM coordinate of the disperser at its natal site
     natalY – this is the y-axis UTM coordinate of the disperser at its natal site
     natalsession - this identifies the trapping session for the natal period and corresponds to the variable "session" in the file MultiSession20131112captures.txt and the variable "sessionID" in the file MultiSession20131112sessions.txt
     settleX - this is the x-axis UTM coordinate of the disperser at its settlement site
     settleY - this is the y-axis UTM coordinate of the disperser at its settlement site
     ...

Data Carpentry Genomics Curriculum Example Data
figshare.com
datasetcatalog.nlm.nih.gov
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olivier Tenaillon; Jeffrey E Barrick; Noah Ribeck; Daniel E. Deatherage; Jeffrey L. Blanchard; Aurko Dasgupta; Gabriel C. Wu; Sébastien Wielgoss; Stéphane Cruvellier; Claudine Medigue; Dominique Schneider; Richard E. Lenski; Taylor Reiter; Jessica Mizzi; Fotis Psomopoulos; Ryan Peek; Jason Williams (2023). Data Carpentry Genomics Curriculum Example Data [Dataset]. http://doi.org/10.6084/m9.figshare.7726454.v3
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7726454.v3
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Olivier Tenaillon; Jeffrey E Barrick; Noah Ribeck; Daniel E. Deatherage; Jeffrey L. Blanchard; Aurko Dasgupta; Gabriel C. Wu; Sébastien Wielgoss; Stéphane Cruvellier; Claudine Medigue; Dominique Schneider; Richard E. Lenski; Taylor Reiter; Jessica Mizzi; Fotis Psomopoulos; Ryan Peek; Jason Williams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Andale Mono'; color: #29f914; background-color: #000000} span.s1 {font-variant-ligatures: no-common-ligatures} These files are intended for use with the Data Carpentry Genomics curriculum (https://datacarpentry.org/genomics-workshop/). Files will be useful for instructors teaching this curriculum in a workshop setting, as well as individuals working through these materials on their own.

This curriculum is normally taught using Amazon Web Services (AWS). Data Carpentry maintains an AWS image that includes all of the data files needed to use these lesson materials. For information on how to set up an AWS instance from that image, see https://datacarpentry.org/genomics-workshop/setup.html. Learners and instructors who would prefer to teach on a different remote computing system can access all required files from this FigShare dataset.

This curriculum uses data from a long term evolution experiment published in 2016: Tempo and mode of genome evolution in a 50,000-generation experiment (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/) by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959). All sequencing data sets are available in the NCBI BioProject database under accession number PRJNA294072 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).

backup.tar.gz: contains original fastq files, reference genome, and subsampled fastq files. Directions for obtaining these files from public databases are given during the lesson https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html). On the AWS image, these files are stored in ~/.backup directory. 1.3Gb in size.

Ecoli_metadata.xlsx: an example Excel file to be loaded during the R lesson.

shell_data.tar.gz: contains the files used as input to the Introduction to the Command Line for Genomics lesson (https://datacarpentry.org/shell-genomics/).

sub.tar.gz: contains subsampled fastq files that are used as input to the Data Wrangling and Processing for Genomics lesson (https://datacarpentry.org/wrangling-genomics/). 109Mb in size.

solutions: contains the output files of the Shell Genomics and Wrangling Genomics lessons, including fastqc output, sam, bam, bcf, and vcf files.

vcf_clean_script.R: converts vcf output in .solutions/wrangling_solutions/variant_calling_auto to single tidy data frame.

combined_tidy_vcf.csv: output of vcf_clean_script.R
Case study: Cyclistic bike-share analysis
kaggle.com
zip
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorge4141 (2022). Case study: Cyclistic bike-share analysis [Dataset]. https://www.kaggle.com/datasets/jorge4141/case-study-cyclistic-bikeshare-analysis
Explore at:
zip(131490806 bytes)Available download formats
Dataset updated
Mar 25, 2022
Authors
Jorge4141
Description
Introduction

This is a case study called Capstone Project from the Google Data Analytics Certificate.

In this case study, I am working as a junior data analyst at a fictitious bike-share company in Chicago called Cyclistic.

Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.

Scenario

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members.

****Primary Stakeholders:****

1: Cyclistic Executive Team

2: Lily Moreno, Director of Marketing and Manager

ASK

How do annual members and casual riders use Cyclistic bikes differently?

Why would casual riders buy Cyclistic annual memberships?

How can Cyclistic use digital media to influence casual riders to become members?

# Prepare

The last four quarters were selected for analysis which cover April 01, 2019 - March 31, 2020. These are the datasets used:

Divvy_Trips_2019_Q2 Divvy_Trips_2019_Q3 Divvy_Trips_2019_Q4 Divvy_Trips_2020_Q1

The data is stored in CSV files. Each file contains one month data for a total of 12 .csv files.

Data appears to be reliable with no bias. It also appears to be original, current and cited.

I used Cyclistic’s historical trip data found here: https://divvy-tripdata.s3.amazonaws.com/index.html

The data has been made available by Motivate International Inc. under this license: https://ride.divvybikes.com/data-license-agreement

Limitations

Financial information is not available.

Process

Used R to analyze and clean data

After installing the R packages, data was collected, wrangled and combined into a single file.

Columns were renamed.

Looked for incongruencies in the dataframes and converted some columns to character type, so they can stack correctly.

Combined all quarters into one big data frame.

Removed unnecessary columns

Analyze

Inspected new data table to ensure column names were correctly assigned.

Formatted columns to ensure proper data types were assigned (numeric, character, etc).

Consolidated the member_casual column.

Added day, month and year columns to aggregate data.

Added ride-length column to the entire dataframe for consistency.

Deleted trip duration rides that showed as negative and bikes out of circulation for quality control.

Replaced the word "member" with "Subscriber" and also replaced the word "casual" with "Customer".

Aggregated data, compared average rides between members and casual users.

Share

After analysis, visuals were created as shown below with R.

Act

Conclusion:

Data appears to show that casual riders and members use bike share differently.

Casual riders' average ride length is more than twice of that of members.

Members use bike share for commuting, casual riders use it for leisure and mostly on the weekends.

Unfortunately, there's no financial data available to determine which of the two (casual or member) is spending more money.

Recommendations

Offer casual riders a membership package with promotions and discounts.
Bank Loan Approval - LR, DT, RF and AUC
kaggle.com
zip
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Bank Loan Approval - LR, DT, RF and AUC [Dataset]. https://www.kaggle.com/datasets/vikramamin/bank-loan-approval-lr-dt-rf-and-auc
Explore at:
zip(61437 bytes)Available download formats
Dataset updated
Nov 7, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
DATASET: Dependent variable is 'Personal.Loan'. 0 indicates loan not approved and 1 indicates loan approved.

OBJECTIVE : We will do Exploratory Data Analysis and use Logistic Regression, Decision Tree, Random Forest and AUC to find out which is the best model. Steps:

Set the working directory and read the data

Check the data types of all the variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F020afd07cf0c5ba058d88add9bcd467a%2FPicture1.png?generation=1699357564112927&alt=media" alt="">

DATA CLEANING

We need to change the data types of certain variables to factor vector

Check for missing data, duplicate records and remove insignificant variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa286a5225207d4419b34bcf800e3cb67%2FPicture2.png?generation=1699357685993423&alt=media" alt="">

New data frame created called 'bank1' after dropping the 'ID' column.

EXPLORATORY DATA ANALYSIS

We will try to get some insights by digging into the data through bar charts and box plots which can help the bank management in decision making

Run the required libraries https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F7363f4b9ca8245b6e998bf07005fa099%2FPicture3.png?generation=1699357871368520&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8dba10f16fc6c2d7fd51a4c82a692136%2FCount%20of%20Loans%20Approved%20%20Not%20Approved.jpeg?generation=1699357967347355&alt=media" alt="">

Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe5eec968e7b264d9ec540bd1f24379fd%2FPicture4.png?generation=1699358066228901&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb64eba6f373d5c043c9f504cfa348a75%2FPicture5.png?generation=1699358103026827&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F94608993dc12cdc31cfeca92932e0cb5%2FBoxPlot%20Income%20and%20Family.jpeg?generation=1699358148840198&alt=media" alt="">

THIS INDICATES THAT INCOME IS HIGHER WHEN THERE ARE LESS FAMILY MEMBERS https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8e44daf4ed42094f71c3000737f07a32%2FPicture6.png?generation=1699360599956530&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0fd9010b95acf9ad20f7b9d0e171f305%2FBoxplot%20between%20Income%20%20Personal%20Loan.jpeg?generation=1699359231020725&alt=media" alt="">

THIS INDICATES PERSONAL LOAN HAS BEEN APPROVED FOR CUSTOMERS HAVING HIGHER INCOME https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ff817481849aba7f176b7c4d0147308de%2FPicture7.png?generation=1699360768102069&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8e0bad8c76aaa11fe3b9909721d587f5%2FBoxPlot%20between%20Income%20%20Credit%20Cards.jpeg?generation=1699360798538907&alt=media" alt="">

THIS INDICATES THAT THE INCOME IS PRETTY SIMILAR FOR CUSTOMERS OWNING AND NOT OWNING A CREDIT CARD https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fab4b2fd2fde2a009bceb05a5a1161040%2FPicture8.png?generation=1699360882879480&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe747dfa315609c4907ea83a9ac7f482c%2FBoxPlot%20between%20Income%20Class%20%20Mortgage.jpeg?generation=1699359265603058&alt=media" alt="">

CUSTOMERS BELONGING TO THE RICH CLASS (INCOME GROUP : 150-200) HAVE THE HIGHEST MORTGAGE https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6552d3fb9564b3ab3239ef67ed17a098%2FPicture9.png?generation=1699360938106437&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F4c7c7077e26229f455c1d9ef6e83195f%2FBoxPlot%20between%20CC%20Avg%20and%20Online%20Banking.jpeg?generation=1699359306645100&alt=media" alt="">

CC AVG IS PRETTY SIMILAR FOR THOSE WHO OPTED FOR ONLINE SERVICES AND THOSE WHO DID NOT
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feddee2ca08a8138bb54eed0c25750280%2FPicture10.png?generation=1699360994581181&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6127e25258b25ccfbae66a5463a72773%2FBoxplot%20between%20CC%20Avg%20and%20Education.jpeg?generation=1699359333295827&alt=media" alt="">

MORE EDUCATED CUSTOMERS HAVE A HIGHER CREDIT AVERAGE ![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F...
Plant functional trait maps for Australia
figshare.com
txt
Updated Mar 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Andrew (2025). Plant functional trait maps for Australia [Dataset]. http://doi.org/10.6084/m9.figshare.27948915.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27948915.v1
Dataset updated
Mar 9, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Samuel Andrew
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Australia
Description
Plant functional trait maps for AustraliaThese data are associated with a manuscript by Samuel C. Andrew, Irene Martín-Forés, Greg R. Guerin, David Coleman, Daniel S. Falster, Elizabeth Wenk, Ian J. Wright, & Rachael V. Gallagher (Mapping plant functional traits using gap-filled datasets to inform management and modelling, in review). All included files can be used to run the Rmarkdown file "Biogeography_Rcode_241010.Rmd " for the main steps of data prep and analyses. Data objects are uploaded in the zip folder “Data_objects_1_to_9.zip”Inputs“Supplementary_methods.zip”: Data and scripts for using AusTraits data to calculate species mean values. The code for calculating mean species values with the method described in the paper is in the “1_build_Austraits_2.R” R code files. Please contact David Coleman (dave.r.coleman@gmail.com) for further information."1_gap_filled_traits_all_natives_species.csv": This file contains the species level trait data for native Australian species. Gap filling estimates were done by grouping varieties and subspecies to the species taxonomic level. The columns with the “_var” suffix have the variance for each estimated values from the trait gap filling workflow. Species with a value in the "_var" columns had estimated values for that trait and can have the trait values converted to NA for a ungap filled dataset (see output 7)."2_Grid_cell_cliamte_data.csv": Includes a list of equal area grid cells (10 x 10 km) for Australia. The latitude and longitude coordinates are given in the second and third columns. Equal areas coordinates are given in the fourth and fifth columns ("x", "y"). The remaining columns have climate data; "AnnMeanTemp" - Mean Annual Temperature (MAT), "AnnPrecip" – Mean Annual Precipitation (MAP), "maxTemp" – Average maximum temperature of the warmest month, "minTemp" - Average minimum temperature of the coldest month."3_species_cell_id_200306.rds": A R .rds file (loaded with the readRDS() function) that includes the stacked species distribution data. The species expected to occur in each grid cell are listed. The plant herbarium occurrence data from the Atlas of Living Australia (ALA) were used for species distribution modelling, see description in Andrew et al., (2021. Journal of Vegetation Science, 32(2), e13018. https://doi.org/10.1111/jvs.13018). The ALA occurrence data were downloaded in December 2019 (see https://doi.org/10.6084/m9.figshare.24503893.v2 for data) but a fresh download of ALA occurrence data is probably recommended for future studies."4_Austraits_taxa_data_230627.csv": taxonomic data from AusTraits that was used to update species names in distribution data ("3_species_cell_id_200306.rds") to binomial names."5_Unweighted_ausplots_Trait_data.csv": Estimates of trait means and variance for AusPlots species inventories.“6_SDM_press_base_raster_10km.tif”: Raster mask of Australia used to plot grid cell values for maps. Projection "Lambert Azimuthal Equal Area", crs = "+proj=laea +lat_0=-25.2744 +lon_0=133.7751 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs".Outputs"7_EA_SDM_10km_GC_Trait_layers.rds": This file can be loaded in R with the load() function and contains the outputs of the first chunk of Rmarkdown code in the “Biogeography_Rcode_241010.Rmd” Rmarkdown file. Includes a data frame for species level trait data (“Trait_data”) and a data frame for grid cell climate and community trait summaries (Aus_grid_cell_data). For the grid cell data each of the four traits have mean (“mean_” prefix), standard deviation(“SD_” prefix), variability (“var_” prefix), maximum (“max_” prefix), minimum (“min_” prefix), number of species with trait data (“species_” prefix), and the percentage of species with trait data ( “_PerCent” suffix). For estimates with gap filled trait data the column names have the “_gap” suffix."8_EA_SDM_10km_GC_models_240422.rds": GAM model outputs used to make plotting figures and variance partitioning analyses faster to run."9_Bootstrap_data_240422.rds": Data and outputs from running the models with 100 random subsets of 10% of the grid cells.Results“Biogeography_Rcode_241010.Rmd”: Rmarkdown file with scripts to use inputs and outputs 1 to 9 and the csv files from “Results_tables.zip”. The script should run as is with the data objects and csv files in the same directory as the Rmarkdown file.“Results_tables.zip”: includes the csv files for Tables 1, 2 and S1 ("Table_1_modes_240422.csv", "Table_2_partitioning_240422.csv", “Table_S1_plot_models_240422.csv”). Used to report model results neatly in the Knit file.“Biogeography_Knit_241010.html”: The output from the Rmarkdown file that shows the figures and tables.“Trait_Maps.zip”: zipped folder of main trait maps as raster layers. Projection "Lambert Azimuthal Equal Area", crs = "+proj=laea +lat_0=-25.2744 +lon_0=133.7751 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs".
Code and data archive for Nettle et al. 'Consequences of measurement error...
zenodo.org
bin, csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Nettle; Daniel Nettle (2020). Code and data archive for Nettle et al. 'Consequences of measurement error in qPCR telomere data: A simulation study' [Dataset]. http://doi.org/10.5281/zenodo.2615735
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2615735
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Nettle; Daniel Nettle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code and data for Nettle et al. 'Consequences of measurement error in qPCR telomere data: A simulation study'

Updated version of March 2019

Main simulation functions are contained in the script ‘simulation.functions.r’. When called, these functions (listed below) return datasets with requested properties containing both the ideal values of the quantities (Cqs, TS, etc.), and their post-error measured values. This allows the user to determine the differences between ideal and measured values, and perform other analyses. All simulation parameter values are user-specifiable. The script ‘paper.results.r’ reproduces all the figures and simulation results from the main paper. 'paper.results.r' also reads in the two .csv files of empirical data (dataset1 and dataset2).

Datasets consist of observations from n individuals. The steps common to all of the simulation functions are as follows:

A vector of n true single copy gene abundances, true.dna.scg is defined, drawn from a normal distribution with mean b and standard deviation var.sample.size (b is a constant).

A vector of n relative telomere lengths, true.telo.var is defined, drawn from a normal distribution with mean 1 and standard deviation telomere.var.

Hence, the true abundance of the telomere sequence is defined, as a*true.dna.scg*true.telo.var. Here, a is a scaling constant representing how many copies of the telomeric sequence there are per single copy gene in the average sample.

Ideal Cq values for both reactions are defined as f – log₂(true.dna.scg) and f – log₂(true.dna.telo), where f is a constant representing the chosen fluorescence threshold.

Measurement errors in the Cqs are generated from a normal distribution with mean 0; standard deviations given by error.scg and error.telo; and a correlation between error.scg and error.telo given by error.cor.

Hence, measured Cqs are generated, which can be compared to the ideal Cq values.

TS ratios are calculated both on the measured Cqs, and the ideal ones.

The following functions are available. Specify desired parameter values in the parenthesis, e.g. generate.one.dataset(n=10000, error.telo=0.1, error.scg=0.1, error.cor=0). Default values in the simulation functions are generally those given in table 1 of the main paper.

generate.one.dataset() returns a simple dataset (one telomere measurement per individual) for chosen values of all the variables described in section 1. As well as ideal and measured Cqs, it returns ideal and measured TS ratios. It also returns the difference between the ideal and measured TS ratio, calculated two ways, computed (error.computed), and using equation (11) of online supplement 1 (error.analytic). Both methods produce the same number. This was included as an additional check of correctness of the simulation.

generate.repeated.measure() returns a dataset where telomere lengths from the same individuals are measured twice, via two independent biological samples, and the true telomere length of each individual is assumed not to have changed at all. The data frame it returns is as for generate.one.dataset(), except that there are two of each variable (e.g. true.ts.1, true.ts.2, measured.ts.1, measured.ts.2, etc.).

calculate.repeatability() calculates the repeatability of the measured T/S ratio (intra-class correlation coefficient) when generate.repeated.measure() is implemented using the given values for all the parameters. It requires prior installation of R package ‘irr’.

compare.repeatability() returns the repeatability of the T/S ratio and the repeatability calculated on the raw Cq for the telomere reaction, for the given parameter values.
120 years of Olympic history: athletes and results
kaggle.com
zip
Updated Jun 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rgriffin (2018). 120 years of Olympic history: athletes and results [Dataset]. https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results
Explore at:
zip(5690772 bytes)Available download formats
Dataset updated
Jun 15, 2018
Authors
rgriffin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.

Note that the Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on. A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered.

Content

The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:

ID - Unique number for each athlete

Name - Athlete's name

Sex - M or F

Age - Integer

Height - In centimeters

Weight - In kilograms

Team - Team name

NOC - National Olympic Committee 3-letter code

Games - Year and season

Year - Integer

Season - Summer or Winter

City - Host city

Sport - Sport

Event - Event

Medal - Gold, Silver, Bronze, or NA

Acknowledgements

The Olympic data on www.sports-reference.com is the result of an incredible amount of research by a group of Olympic history enthusiasts and self-proclaimed 'statistorians'. Check out their blog for more information. All I did was consolidated their decades of work into a convenient format for data analysis.

Inspiration

This dataset provides an opportunity to ask questions about how the Olympics have evolved over time, including questions about the participation and performance of women, different nations, and different sports and events.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston

Data Mining Project - Boston

Explore at:

zip(59313797 bytes)Available download formats

Dataset updated

Nov 25, 2019

Authors

SophieLiu

Area covered

Boston

Description

Context

To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

Use of Data Files

You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

This loads the file into R

df<-read.csv('uber.csv')

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

df_black<-subset(uber_df, uber_df$name == 'Black')

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

getwd()

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Clear search

Close search

Google apps

Main menu

Data Mining Project - Boston

Context

Use of Data Files

This loads the file into R

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Kickastarter Campaigns

Myrstener et al. (2025) Downstream temperature effects of boreal forest...

Google Data Analytics Case Study Cyclistic

Introduction

Scenario

Ask

Guiding Question:

Prepare

Guiding Question:

Process

Guiding Question:

Analyze Phase:

Guiding Questions:

Share

Guiding Quesions:

Data for analysis in Barrie et al. (2025)

Data for the Farewell and Herberg example of a two-phase experiment using a...

The CORESIDENCE Database: National and Subnational Data on Household and...

Data from: A Phanerozoic gridded dataset for palaeogeographic...

Countries by population 2021 (Worldometer)

Context

Content

Acknowledgements

Data from: Code and Data Schimmelradar manuscript 1.1

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Data from: Earth surface evolution: a Phanerozoic gridded dataset of Global...

Supplement 1. Details of the spatially-explicit capture–recapture model,...

Data Carpentry Genomics Curriculum Example Data

Case study: Cyclistic bike-share analysis

Introduction

Scenario

****Primary Stakeholders:****

ASK

Limitations

Process

Analyze

Share

Act

Recommendations

Bank Loan Approval - LR, DT, RF and AUC

Plant functional trait maps for Australia

Code and data archive for Nettle et al. 'Consequences of measurement error...

120 years of Olympic history: athletes and results

Context

Content

Acknowledgements

Inspiration

Data Mining Project - Boston

Context

Use of Data Files

This loads the file into R

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Primary Stakeholders: