Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.
In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.
They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.
When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set.
python
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
- where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set.
- Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘School Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/smeilisa07/number of school teacher student class on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This is my first analyst data. This dataset i got from open data Jakarta website (http://data.jakarta.go.id/), so mostly the dataset is in Indonesian. But i have try describe it that you can find it on VARIABLE DESCRIPTION.txt file.
The title of this dataset is jumlah-sekolah-guru-murid-dan-ruang-kelas-menurut-jenis-sekolah-2011-2016, with type is CSV, so you can easily access it. If you not understand, the title means the number of school, teacher, student, and classroom according to the type of school 2011 - 2016. I think, if you just read from the title, you can imagine the contents. So this dataset have 50 observations and 8 variables, taken from 2011 until 2016.
In general, this dataset is about the quality of education in Jakarta, which each year some of school level always decreasing and some is increase, but not significant.
This dataset comes from Indonesian education authorities, which is already established in the CSV file by Open Data Jakarta.
Althought this data given from Open Data Jakarta publicly, i want always continue to improve my Data Scientist skill, especially in R programming, because i think R programming is easy to learn and really help me to be always curious about Data Scientist. So, this dataset that I am still struggle with below problem, and i need solution.
Question :
How can i cleaning this dataset ? I have try cleaning this dataset, but i still not sure. You can check on
my_hypothesis.txt file, when i try cleaning and visualize this dataset.
How can i specify the model for machine learning ? What recommended steps i should take ?
How should i cluster my dataset, if i want the label is not number but tingkat_sekolah for every tahun and
jenis_sekolah ? You can check on my_hypothesis.txt file.
--- Original source retains full ownership of the source dataset ---
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
C. Brock Kirwan 1001 KMBL, Brigham Young University, Provo, UT 84602 Email: kirwan@byu.edu Phone: 801-422-2532 Fax: 801-422-0602 ORCID ID: 0000-003-0768-1446
Limited Evidence for a Triple Dissociation in the Medial Temporal Lobe: an fMRI Recognition Memory Replication Study
2020-2021
The present experiment aims to replicate two previous papers (cited below) in which authors present two analysis paths for a dataset in which participants underwent fMRI while performing a recognition memory test for old and new words. Both studies found activation in the hippocampus, with the first (Daselaar, Fleck, & Cabeza, 2006) demonstrating a distinction in hippocampus activation corresponding to true and perceived oldness of stimuli and the second (Daselaar, Fleck, Prince, & Cabeza, 2006) demonstrating that hippocampus activation reflects the subjective experience of the participant.
We replicated behavioral and MRI acquisition parameters reported in these two target articles with N=53 participants and focused fMRI analyses on regions of interest reported in those articles looking at fMRI activation for differences corresponding with true and perceived oldness and those associated with subjective memory experiences of recollection, familiarity, and novelty.
References: (1) Daselaar, S. M., Fleck, M. S., & Cabeza, R. (2006). Triple dissociation in the medial temporal lobes: Recollection, familiarity, and novelty. J Neurophysiol, 96(4), 1902–1911. https://doi.org/10.1152/jn.01029.2005 (2) Daselaar, S. M., Fleck, M. S., Prince, S. E., & Cabeza, R. (2006). The medial temporal lobe distinguishes old from new independently of consciousness. J Neurosci, 26(21), 5835–5839. https://doi.org/26/21/5835 [pii] 10.1523/JNEUROSCI.0258-06.2006
This dataset includes raw data from all scanned participants acquired by the Siemens Trio 3T MRI scanner (12-channel head coil), with each participant consisting of the following folders: /anat, /fmap, and /func. /anat includes structural imaging data obtained from scanning in the form of .nii.gz and .json files. /fmap includes field mapping data in the form of .nii.gz and .json files. /func includes functional imaging data obtained from scanning in the form of .nii.gz and .json files, along with event.tsv files for each run (total runs = 4). Data for a total of N=53 participants is included in the present dataset.
True vs Perceived Oldness: Mean activity (mean parameter estimates) for each individual trial in the anterior/posterior MTL regions were identified by true oldness and perceived novelty contrasts. These resulting values were entered into a logistic regression model with activations in the MTL regions set as independent variables. Subjective Confidence: Mean activity for each individual trial from different MTL regions were identified and entered into a multiple regression model based on activations in different MTL regions (i.e., recollection-related activity, familiarity-related activity, and novelty-related activity) as independent variables.
True vs Perceived Oldness: A binary variable reflecting whether participants correctly recognized an old item as old (hit) or incorrectly classified an old item as new (miss) were set as the dependent variable. Subjective Confidence: 6-point oldness scale was entered as the dependent variable.
N/A
Data were preprocessed, which included spatial motion correction and spatial normalization that was automatically generated by the fMRIPrep software. Following fMRIPrep preprocessing, functional data were scaled with a mean of 100 and blurred with an 8 mm FWHM Gaussian kernel to account for inter-subject anatomical variation. Analysis scripts are available here: https://osf.io/ctvsw/. Data was acquired for N=60 participants, with data from n=7 participants excluded for reasons of ineligibility (left-handedness, n=1), failure to comply with study procedures (n=2), excessive motion (n=3), and equipment error (n=1).
In our experimental task, participants completed a study phase in which they were presented with a randomized list of 120 real English words and 80 pseudo words at a rate of 2000 ms per item. A fixation cross was presented between words for a random time interval varying between 0-5500 ms, where participants indicated whether the stimulus presented was a word or pseudo word. They were not informed at this time that their memory for the words would be tested. After the completion of the study phase, researchers situated participants in the MRI scanner and obtained localizer, field map, and T1-weighted structural MRI scans before initiating the test phase of the experiment.
During the test phase, a task paradigm was presented as four experimental runs lasting between 435-442 seconds. Participants saw an equal number of target stimuli (words shown during the study phase) and foil stimuli (novel words) at 60 words per run. Target and foil stimuli were presented in a randomized order at 3.4 seconds. Participants were asked to make judgments on whether the word was presented on the study list while the stimulus was displayed. Confidence ratings were collected for those judgments between true and perceived oldness of stimuli from 1 (lowest confidence) to 4 (highest confidence), with a prompt displayed for 1.7 seconds.
Recruitment: To determine sample size, an a prior power analysis was done by extracting values from Figure 1 of (Daselaar, Fleck, Prince, et al., 2006) in the right hippocampus via Web Plot Digitizer, given that the region showed smaller differences. We computed main effects by averaging hits and misses, and CRs and FAs prior to SEM to SD conversion and averaging again. Resulting values were entered into g+power to estimate an effect size of 0.46, indicating that a sample of N=54 would achieve a power of 0.95 with an error probability of 0.05 (t(1,53)=1.67). Participants were recruited from the campus community and met MRI compliance screening requirements. Exclusion: Non-native English speakers, history of drug use, previous psychiatric or neurologic diagnosis, or contra-indications for MRI (e.g., ferromagnetic implant). Compensation: Participants were compensated for participation with a choice of $20, course credit, or a 3D-printed 1/4-scale model of their brains.
Localizer, field map, and T1-weighted structural MRI scans were obtained once the participants were situated in the scanner. MRI data were collected using a Siemens Trio 3T MRI scanner (12-channel head coil) and behavioral responses were collected using a four-key fiber-optic response cylinder (Current Designs, n.d.). Structural scanning was done at the beginning of the scan session (256 x 215 matrix, TR 1900 ms, TE 2.26 ms, FOV 250 x 218 mm, 176 slices, 1 mm slice thickness, 0 mm spacing) and functional scanning was done during all experimental runs (64 x 64 image matrix, TR 1800 ms, TE 31 ms, FOV 240 mm, 34 slices, 3.8 mm slice thickness). An MR-compatible LCD monitor displayed stimuli from the head of the bore, which participants viewed through a mirror mounted on the head coil. MRI data are available at: https://openneuro.org/datasets/ds004086.
[See above under STUDY PHASE and TEST PHASE for procedures performed once the participant arrived.]
Behavioral and imaging data were collected for each participant through the course of four (4) experimental runs. Behavioral data was used to create event.tsv files for each participant per run, indicating the onset, duration, trial type, stimuli response, correct answer, and reaction times of responses. Each experimental run lasted between 435-442 seconds, where participants saw an equal number of target stimuli (words shown in the study phase) and foil stimuli (novel words) at 60 words per run.
Stimuli were presented for 3.4 seconds, where participants were asked to make judgments indicating whether the word was presented on the study list while the stimulus was displayed. Confidence ratings were then collected for those judgments between true and perceived oldness of stimuli from 1 (lowest confidence) to 4 (highest confidence). Prompt for the confidence ratings was displayed for 1.7 seconds, with each trial separated by an inter-trial interval (ITI) consisting of a fixation cross with a randomly distributed duration of 0-5.4 seconds (mean ITI=2.7 seconds).
Behavioral data were identified as hits, misses, correct rejections (CRs), and false alarms (FAs). Hits indicated correct judgments of “old” for words that were actually old. Misses reflected incorrect judgments of “old” for words that were actually new. Correct rejections indicated correct judgments of “new” for new words, and false alarms represented incorrect judgments of “new” for old words.
The study was performed in the MRI Research Facility at the Brigham Young University campus in Provo, UT.
The following subjects may be missing data and/or are not included in analyses for the following reasons: Sub-001: Ineligible; left-handedness Sub-005: Failure to comply; completed only 10% of entries compared to other subjects Sub-026: Excessive motion Sub-034: Failure to comply; did not provide a response other than a “1” or none Sub-050: Excessive motion Sub-052: Excessive motion Sub-056: Equipment error
Sub-054 restarted their testing and completed the study protocol in full in the latter session.
This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.
Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.
Each dataframe contains 55 columns:
Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).
Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).
Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.
Columns 4 to 55 contain the process variables; the column names retain the original variable names.
This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.
By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.
The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.
In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.
Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.
When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.
This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General description
SAPFLUXNET contains a global database of sap flow and environmental data, together with metadata at different levels. SAPFLUXNET is a harmonised database, compiled from contributions from researchers worldwide.
The SAPFLUXNET version 0.1.5 database harbours 202 globally distributed datasets, from 121 geographical locations. SAPFLUXNET contains sap flow data for 2714 individual plants (1584 angiosperms and 1130 gymnosperms), belonging to 174 species (141 angiosperms and 33 gymnosperms), 95 different genera and 45 different families. More information on the database coverage can be found here: http://sapfluxnet.creaf.cat/shiny/sfn_progress_dashboard/.
The SAPFLUXNET project has been developed by researchers at CREAF and other institutions (http://sapfluxnet.creaf.cat/#team), coordinated by Rafael Poyatos (CREAF, http://www.creaf.cat/staff/rafael-poyatos-lopez), and funded by two Spanish Young Researcher's Grants (SAPFLUXNET, CGL2014-55883-JIN; DATAFORUSE, RTI2018-095297-J-I00 ) and an Alexander von Humboldt Research Fellowship for Experienced Researchers).
Changelog
Compared to version 0.1.4, this version includes some changes in the metadata, but all time series data (sap flow, environmental) remain the same.
For all datasets, climate metadata (temperature and precipitation, ‘si_mat’ and ‘si_map’) have been extracted from CHELSA (https://chelsa-climate.org/), replacing the previous climate data obtained with Wordclim. This change has modified the biome classification of the datasets in ‘si_biome’.
In ‘species’ metadata, the percentage of basal area with sap flow measurements for each species (‘sp_basal_area_perc’) is now assigned a value of 0 if species are in the understorey. This affects two datasets: AUS_MAR_UBD and AUS_MAR_UBW, where, previously, the sum of species basal area percentages could add up to more than 100%.
In ‘species’ metadata, the percentage of basal area with sap flow measurements for each species (‘sp_basal_area_perc’) has been corrected for datasets USA_SIL_OAK_POS, USA_SIL_OAK_1PR, USA_SIL_OAK_2PR.
In ‘site’ metadata, the vegetation type (‘si_igbp’) has been changed to SAV for datasets CHN_ARG_GWD and CHN_ARG_GWS.
Variables and units
SAPFLUXNET contains whole-plant sap flow and environmental variables at sub-daily temporal resolution. Both sap flow and environmental time series have accompanying flags in a data frame, one for sap flow and another for environmental variables. These flags store quality issues detected during the quality control process and can be used to add further quality flags.
Metadata contain relevant variables informing about site conditions, stand characteristics, tree and species attributes, sap flow methodology and details on environmental measurements. The description and units of all data and metadata variables can be found here: Metadata and data units.
To learn more about variables, units and data flags please use the functionalities implemented in the sapfluxnetr package (https://github.com/sapfluxnet/sapfluxnetr). In particular, have a look at the package vignettes using R:
library(sapfluxnetr)
vignette(package='sapfluxnetr')
vignette('metadata-and-data-units', package='sapfluxnetr')
vignette('data-flags', package='sapfluxnetr')
Data formats
SAPFLUXNET data can be found in two formats: 1) RData files belonging to the custom-built 'sfn_data' class and 2) Text files in .csv format. We recommend using the sfn_data objects together with the sapfluxnetr package, although we also provide the text files for convenience. For each dataset, text files are structured in the same way as the slots of sfn_data objects; if working with text files, we recommend that you check the data structure of 'sfn_data' objects in the corresponding vignette.
Working with sfn_data files
To work with SAPFLUXNET data, first they have to be downloaded from Zenodo, maintaining the folder structure. A first level in the folder hierarchy corresponds to file format, either RData files or csv's. A second level corresponds to how sap flow is expressed: per plant, per sapwood area or per leaf area. Please note that interconversions among the magnitudes have been performed whenever possible. Below this level, data have been organised per dataset. In the case of RData files, each dataset is contained in a sfn_data object, which stores all data and metadata in different slots (see the vignette 'sfn-data-classes'). In the case of csv files, each dataset has 9 individual files, corresponding to metadata (5), sap flow and environmental data (2) and their corresponding data flags (2).
After downloading the entire database, the sapfluxnetr package can be used to: - Work with data from a single site: data access, plotting and time aggregation. - Select the subset datasets to work with. - Work with data from multiple sites: data access, plotting and time aggregation.
Please check the following package vignettes to learn more about how to work with sfn_data files:
Quick guide
Metadata and data units
sfn_data classes
Custom aggregation
Memory and parallelization
Working with text files
We recommend to work with sfn_data objects using R and the sapfluxnetr package and we do not currently provide code to work with text files.
Data issues and reporting
Please report any issue you may find in the database by sending us an email: sapfluxnet@creaf.uab.cat.
Temporary data fixes, detected but not yet included in released versions will be published in SAPFLUXNET main web page ('Known data errors').
Data access, use and citation
This version of the SAPFLUXNET database is open access and corresponds to the data paper submitted to Earth System Science Data in August 2020.
When using SAPFLUXNET data in an academic work, please cite the data paper, when available, or alternatively, the Zenodo dataset (see the ‘Cite as’ section on the right panels of this web page).
The high-frequency phone survey of refugees monitors the economic and social impact of and responses to the COVID-19 pandemic on refugees and nationals, by calling a sample of households every four weeks. The main objective is to inform timely and adequate policy and program responses. Since the outbreak of the COVID-19 pandemic in Ethiopia, two rounds of data collection of refugees were completed between September and November 2020. The first round of the joint national and refugee HFPS was implemented between the 24 September and 17 October 2020 and the second round between 20 October and 20 November 2020.
Household
Sample survey data [ssd]
The sample was drawn using a simple random sample without replacement. Expecting a high non-response rate based on experience from the HFPS-HH, we drew a stratified sample of 3,300 refugee households for the first round. More details on sampling methodology are provided in the Survey Methodology Document available for download as Related Materials.
Computer Assisted Telephone Interview [cati]
The Ethiopia COVID-19 High Frequency Phone Survey of Refugee questionnaire consists of the following sections:
A more detailed description of the questionnaire is provided in Table 1 of the Survey Methodology Document that is provided as Related Materials. Round 1 and 2 questionnaires available for download.
DATA CLEANING At the end of data collection, the raw dataset was cleaned by the Research team. This included formatting, and correcting results based on monitoring issues, enumerator feedback and survey changes. Data cleaning carried out is detailed below.
Variable naming and labeling: • Variable names were changed to reflect the lowercase question name in the paper survey copy, and a word or two related to the question. • Variables were labeled with longer descriptions of their contents and the full question text was stored in Notes for each variable. • “Other, specify” variables were named similarly to their related question, with “_other” appended to the name. • Value labels were assigned where relevant, with options shown in English for all variables, unless preloaded from the roster in Amharic.
Variable formatting:
• Variables were formatted as their object type (string, integer, decimal, time, date, or datetime).
• Multi-select variables were saved both in space-separated single-variables and as multiple binary variables showing the yes/no value of each possible response.
• Time and date variables were stored as POSIX timestamp values and formatted to show Gregorian dates.
• Location information was left in separate ID and Name variables, following the format of the incoming roster. IDs were formatted to include only the variable level digits, and not the higher-level prefixes (2-3 digits only.)
• Only consented surveys were kept in the dataset, and all personal information and internal survey variables were dropped from the clean dataset. • Roster data is separated from the main data set and kept in long-form but can be merged on the key variable (key can also be used to merge with the raw data).
• The variables were arranged in the same order as the paper instrument, with observations arranged according to their submission time.
Backcheck data review: Results of the backcheck survey are compared against the originally captured survey results using the bcstats command in Stata. This function delivers a comparison of variables and identifies any discrepancies. Any discrepancies identified are then examined individually to determine if they are within reason.
The following data quality checks were completed: • Daily SurveyCTO monitoring: This included outlier checks, skipped questions, a review of “Other, specify”, other text responses, and enumerator comments. Enumerator comments were used to suggest new response options or to highlight situations where existing options should be used instead. Monitoring also included a review of variable relationship logic checks and checks of the logic of answers. Finally, outliers in phone variables such as survey duration or the percentage of time audio was at a conversational level were monitored. A survey duration of close to 15 minutes and a conversation-level audio percentage of around 40% was considered normal. • Dashboard review: This included monitoring individual enumerator performance, such as the number of calls logged, duration of calls, percentage of calls responded to and percentage of non-consents. Non-consent reason rates and attempts per household were monitored as well. Duration analysis using R was used to monitor each module's duration and estimate the time required for subsequent rounds. The dashboard was also used to track overall survey completion and preview the results of key questions. • Daily Data Team reporting: The Field Supervisors and the Data Manager reported daily feedback on call progress, enumerator feedback on the survey, and any suggestions to improve the instrument, such as adding options to multiple choice questions or adjusting translations. • Audio audits: Audio recordings were captured during the consent portion of the interview for all completed interviews, for the enumerators' side of the conversation only. The recordings were reviewed for any surveys flagged by enumerators as having data quality concerns and for an additional random sample of 2% of respondents. A range of lengths were selected to observe edge cases. Most consent readings took around one minute, with some longer recordings due to questions on the survey or holding for the respondent. All reviewed audio recordings were completed satisfactorily. • Back-check survey: Field Supervisors made back-check calls to a random sample of 5% of the households that completed a survey in Round 1. Field Supervisors called these households and administered a short survey, including (i) identifying the same respondent; (ii) determining the respondent's position within the household; (iii) confirming that a member of the the data collection team had completed the interview; and (iv) a few questions from the original survey.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
It is a dataset that describing Portugal bank marketing campaigns results. Conducted campaigns were based mostly on direct phone calls, offering bank client to place a term deposit. If after all marking afforts client had agreed to place deposit - target variable marked 'yes', otherwise 'no'
Sourse of the data https://archive.ics.uci.edu/ml/datasets/bank+marketing
Citation Request:
This dataset is public available for research. The details are described in S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
Title: Bank Marketing (with social/economic context)
Sources Created by: Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE-IUL) @ 2014
Past Usage:
The full dataset (bank-additional-full.csv) was described and analyzed in:
S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems (2014), doi:10.1016/j.dss.2014.03.001.
Relevant Information:
This dataset is based on "Bank Marketing" UCI dataset (please check the description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing). The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb. This dataset is almost identical to the one used in Moro et al., 2014. Using the rminer package and R tool (http://cran.r-project.org/web/packages/rminer/), we found that the addition of the five new social and economic attributes (made available here) lead to substantial improvement in the prediction of a success, even when the duration of the call is not included. Note: the file can be read in R using: d=read.table("bank-additional-full.csv",header=TRUE,sep=";")
The binary classification goal is to predict if the client will subscribe a bank term deposit (variable y).
Number of Instances: 41188 for bank-additional-full.csv
Number of Attributes: 20 + output attribute.
Attribute information:
For more information, read [Moro et al., 2014].
Input variables:
*1 - age (numeric)
*2 - job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
*3 - marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
*4 - education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
5 - default: has credit in default? (categorical: "no","yes","unknown")
6 - housing: has housing loan? (categorical: "no","yes","unknown")
7 - loan: has personal loan? (categorical: "no","yes","unknown")
*9 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
*10 - day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
*11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
*12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
*13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
*14 - previous: number of contacts performed before this campaign and for this client (numeric)
1515 - poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")
*16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
*17 - cons.price.idx: consumer price index - monthly indicator (numeric)
*18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
*19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
Output variable (desired target): * 21 - y - h...
Phase 1: ASK
1. Business Task * Cyclist is looking to increase their earnings, and wants to know if creating a social media campaign can influence "Casual" users to become "Annual" members.
2. Key Stakeholders: * The main stakeholder from Cyclist is Lily Moreno, whom is the Director of Marketing and responsible for the development of campaigns and initiatives to promote their bike-share program. The other teams involved with this project will be Marketing & Analytics, and the Executive Team.
3. Business Task: * Comparing the two kinds of users and defining how they use the platform, what variables they have in common, what variables are different, and how can they get Casual users to become Annual members
Phase 2: PREPARE:
1. Determine Data Credibility * Cyclist provided data from years 2013-2021 (through March 2021), all of which is first-hand data collected by the company.
2. Sort & Filter Data: * The stakeholders want to know how the current users are using their service, so I am focusing on using the data from 2020-2021 since this is the most relevant period of time to answer the business task.
#Installing packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
install.packages("readr", repos = "http://cran.us.r-project.org")
install.packages("janitor", repos = "http://cran.us.r-project.org")
install.packages("geosphere", repos = "http://cran.us.r-project.org")
install.packages("gridExtra", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(readr)
library(janitor)
library(geosphere)
library(gridExtra)
#Importing data & verifying the information within the dataset
all_tripdata_clean <- read.csv("/Data Projects/cyclist/cyclist_data_cleaned.csv")
glimpse(all_tripdata_clean)
summary(all_tripdata_clean)
Phase 3: PROCESS
1. Cleaning Data & Preparing for Analysis: * Once the data has been placed into one dataset, and checked for errors, we began cleaning the data. * Eliminating data that correlates to the company servicing the bikes, and any ride with a traveled distance of zero. * New columns will be added to assist in the analysis, and to provide accurate assessments of whom is using the bikes.
#Eliminating any data that represents the company performing maintenance, and trips without any measureable distance
all_tripdata_clean <- all_tripdata_clean[!(all_tripdata_clean$start_station_name == "HQ QR" | all_tripdata_clean$ride_length<0),]
#Creating columns for the individual date components (days_of_week should be run last)
all_tripdata_clean$day_of_week <- format(as.Date(all_tripdata_clean$date), "%A")
all_tripdata_clean$date <- as.Date(all_tripdata_clean$started_at)
all_tripdata_clean$day <- format(as.Date(all_tripdata_clean$date), "%d")
all_tripdata_clean$month <- format(as.Date(all_tripdata_clean$date), "%m")
all_tripdata_clean$year <- format(as.Date(all_tripdata_clean$date), "%Y")
** Now I will begin calculating the length of rides being taken, distance traveled, and the mean amount of time & distance.**
#Calculating the ride length in miles & minutes
all_tripdata_clean$ride_length <- difftime(all_tripdata_clean$ended_at,all_tripdata_clean$started_at,units = "mins")
all_tripdata_clean$ride_distance <- distGeo(matrix(c(all_tripdata_clean$start_lng, all_tripdata_clean$start_lat), ncol = 2), matrix(c(all_tripdata_clean$end_lng, all_tripdata_clean$end_lat), ncol = 2))
all_tripdata_clean$ride_distance = all_tripdata_clean$ride_distance/1609.34 #converting to miles
#Calculating the mean time and distance based on the user groups
userType_means <- all_tripdata_clean %>% group_by(member_casual) %>% summarise(mean_time = mean(ride_length))
userType_means <- all_tripdata_clean %>%
group_by(member_casual) %>%
summarise(mean_time = mean(ride_length),mean_distance = mean(ride_distance))
Adding in calculations that will differentiate between bike types and which type of user is using each specific bike type.
#Calculations
with_bike_type <- all_tripdata_clean %>% filter(rideable_type=="classic_bike" | rideable_type=="electric_bike")
with_bike_type %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual,rideable_type,weekday) %>%
summarise(totals=n(), .groups="drop") %>%
with_bike_type %>%
group_by(member_casual,rideable_type) %>%
summarise(totals=n(), .groups="drop") %>%
#Calculating the ride differential
all_tripdata_clean %>%
mutate(weekday = wkday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length),.groups = 'drop') %>%
arrange(me...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,