Facebook
TwitterWelcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.
ASK
Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?
PREPARE
I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign
Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.
Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.
PROCESS
I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries:
R
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(tidyr)
I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:
csv_files <- list.files(pattern = "\\.csv$")
data_frames <- list()
for (file in csv_files) {
variable_name <- sub("\\.csv$", "", file)
assign(variable_name, read.csv(file))
data_frames[[variable_name]] <- get(variable_name)
}
Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.
data_frames <- lapply(data_frames, function(df) {
df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_pledged <- as.numeric(df$usd_pledged)
return(df)
})
In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):
all_nov_2023 = bind_rows(data_frames)
all_dec_2023 = bind_rows(data_frames)
all_jan_2024 = bind_rows(data_frames)`
After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.
filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month
select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>%
filter(created_at != 0 & deadline != 0 & launched_at != 0) %>%
mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>%
mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>%
mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>%
mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>%
select(-category, -deadline, -launched_at, -created_at) %>%
relocate(created, launched, setted_deadline, .before = goal)
write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)
The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
This repository provides access to five pre-computed reconstruction files as well as the static polygons and rotation files used to generate them. This set of palaeogeographic reconstruction files provide palaeocoordinates for three global grids at H3 resolutions 2, 3, and 4, which have an average cell spacing of ~316 km, ~119 km, and ~45 km, respectively. Grids were reconstructed at a temporal resolution of one million years throughout the entire Phanerozoic (540–0 Ma). The reconstruction files are stored as comma-separated-value (CSV) files which can be easily read by almost any spreadsheet program (e.g. Microsoft Excel and Google Sheets) or programming language (e.g. Python, Julia, and R). In addition, R Data Serialization (RDS) files—a common format for saving R objects—are also provided as lighter (and compressed) alternatives to the CSV files. The structure of the reconstruction files follows a wide-form data frame structure to ease indexing. Each file consists of three initial index columns relating to the H3 cell index (i.e. the 'H3 address'), present-day longitude of the cell centroid, and the present-day latitude of the cell centroid. The subsequent columns provide the reconstructed longitudinal and latitudinal coordinate pairs for their respective age of reconstruction in ascending order, indicated by a numerical suffix. Each row contains a unique spatial point on the Earth's continental surface reconstructed through time. NA values within the reconstruction files indicate points which are not defined in deeper time (i.e. either the static polygon does not exist at that time, or it is outside the temporal coverage as defined by the rotation file).
The following five Global Plate Models are provided (abbreviation, temporal coverage, reference) within the GPMs folder:
In addition, the H3 grids for resolutions 2, 3, and 4 are provided within the grids folder. Finally, we also provide two scripts (python and R) within the code folder which can be used to generate reconstructed coordinates for user data from the reconstruction files.
For access to the code used to generate these files:
https://github.com/LewisAJones/PhanGrids
For more information, please refer to the article describing the data:
Jones, L.A. and Domeier, M.M. 2024. A Phanerozoic gridded dataset for palaeogeographic reconstructions. (2024).
For any additional queries, contact:
Lewis A. Jones (lewisa.jones@outlook.com) or Mathew M. Domeier (mathewd@uio.no)
If you use these files, please cite:
Jones, L.A. and Domeier, M.M. 2024. A Phanerozoic gridded dataset for palaeogeographic reconstructions. DOI: 10.5281/zenodo.10069221
References
Facebook
TwitterThe 1996 Zambia Demographic and Health Survey (ZDHS) is a nationally representative survey conducted by the Central Statistical Office at the request of the Ministry of Health, with the aim of gathering reliable information on fertility, childhood and maternal mortality rates, maternal and child health indicators, contraceptive knowledge and use, and knowledge and prevalence of sexually transmitted diseases (STDs) including AIDS. The survey is a follow-up to the Zambia DHS survey carried out in 1992.
The primary objectives of the ZDHS are: - To collect up-to-date information on fertility, infant and child mortality and family planning; - To collect information on health-related matters such as breastfeeding, antenatal care, children's immunisations and childhood diseases; - To assess the nutritional status of mothers and children; iv) To support dissemination and utilisation of the results in planning, managing and improving family planning and health services in the country; and - To enhance the survey capabilities of the institutions involved in order to facilitate the implementation of surveys of this type in the future.
SUMMARY OF FINDINGS
FERTILITY
FAMILY PLANNING
MATERNAL AND CHILD HEALTH
The 1996 Zambia Demographic and Health Survey (ZDHS) is a nationally representative survey. The sample was designed to produce reliable estimates for the country as a whole, for the urban and the rural areas separately, and for each of the nine provinces in the country.
The survey covered all de jure household members (usual residents), all women of reproductive age, aged 15-49 years in the total sample of households, men aged 15-59 and Children under age 5 resident in the household.
Sample survey data
The 1996 ZDHS covered the population residing in private households in the country. The design for the ZDHS called for a representative probability sample of approximately 8,000 completed individual interviews with women between the ages of 15 and 49. It is designed principally to produce reliable estimates for the country as a whole, for the urban and the rural areas separately, and for each of the nine provinces in the country. In addition to the sample of women, a sub-sample of about 2,000 men between the ages of 15 and 59 was also designed and selected to allow for the study of AIDS knowledge and other topics.
SAMPLING FRAME
Zambia is divided administratively into nine provinces and 57 districts. For the Census of Population, Housing and Agriculture of 1990, the whole country was demarcated into census supervisory areas (CSAs). Each CSA was in turn divided into standard enumeration areas (SEAs) of approximately equal size. For the 1992 ZDHS, this frame of about 4,200 CSAs and their corresponding SEAs served as the sampling frame. The measure of size was the number of households obtained during a quick count operation carried out in 1987. These same CSAs and SEAs were later updated with new measures of size which are the actual numbers of households and population figures obtained in the census. The sample for the 1996 ZDHS was selected from this updated CSA and SEA frame.
CHARACTERISTICS OF THE AMPLE
The sample for ZDHS was selected in three stages. At the first stage, 312 primary sampling units corresponding to the CSAs were selected from the frame of CSAs with probability proportional to size, the size being the number of households obtained from the 1990 census. At the second stage, one SEA was selected, again with probability proportional to size, within each selected CSA. An updating of the maps as well as a complete listing of the households in the selected SEAs was carried out. The list of households obtained was used as the frame for the third-stage sampling in which households were selected for interview. Women between the ages of 15 and 49 were identified in these households and interviewed. Men between the ages of 15 and 59 were also interviewed, but only in one-fourth of the households selected for the women's survey.
SAMPLE ALLOCATION
The provinces, stratified by urban and rural areas, were the sampling strata. There were thus 18 strata. The proportional allocation would result in a completely self-weighting sample but would not allow for reliable estimates for at least three of the nine provinces, namely Luapula, North-Western and Western. Results of other demographic and health surveys show that a minimum sample of 800-1,000 women is required in order to obtain estimates of fertility and childhood mortality rates at an acceptable level of sampling errors. It was decided to allocate a sample of 1,000 women to each of the three largest provinces, and a sample of 800 women to the two smallest provinces. The remaining provinces got samples of 850 women. Within each province, the sample was distributed approximately proportionally to the urban and rural areas.
STRATIFICATION AND SYSTEMATIC SELECTION OF CLUSTERS
A cluster is the ultimate area unit retained in the survey. In the 1992 ZDHS and the 1996 ZDHS, the cluster corresponds exactly to an SEA selected from the CSA that contains it. In order to decrease sampling errors of comparisons over time between 1992 and 1996--it was decided that as many as possible of the 1992 clusters be retained. After carefully examining the 262 CSAs that were included in the 1992 ZDHS, locating them in the updated frame and verifying their SEA composition, it was decided to retain 213 CSAs (and their corresponding SEAs). This amounted to almost 70 percent of the new sample. Only 99 new CSAs and their corresponding SEAs were selected.
As in the 1992 ZDHS, stratification of the CSAs was only geographic. In each stratum, the CSAs were listed by districts ordered geographically. The procedure for selecting CSAs in each stratum consisted of: (1) calculating the sampling interval for the stratum: (2) calculating the cumulated size of each CSA; (3) calculating the series of sampling numbers R, R+I, R+21, .... R+(a-1)l, where R is a random number between 1 and 1; (4) comparing each sampling number with the cumulated sizes.
The reasons for not
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterWelcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.
ASK
Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?
PREPARE
I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign
Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.
Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.
PROCESS
I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries:
R
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(tidyr)
I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:
csv_files <- list.files(pattern = "\\.csv$")
data_frames <- list()
for (file in csv_files) {
variable_name <- sub("\\.csv$", "", file)
assign(variable_name, read.csv(file))
data_frames[[variable_name]] <- get(variable_name)
}
Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.
data_frames <- lapply(data_frames, function(df) {
df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_pledged <- as.numeric(df$usd_pledged)
return(df)
})
In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):
all_nov_2023 = bind_rows(data_frames)
all_dec_2023 = bind_rows(data_frames)
all_jan_2024 = bind_rows(data_frames)`
After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.
filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month
select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>%
filter(created_at != 0 & deadline != 0 & launched_at != 0) %>%
mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>%
mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>%
mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>%
mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>%
select(-category, -deadline, -launched_at, -created_at) %>%
relocate(created, launched, setted_deadline, .before = goal)
write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)
The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...