Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.
The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The present study updates and extends the meta-analysis by Haus et al. (2013) who applied the theory of planned behavior (TPB) to analyze gender differences in the motivation to start a business. We extend this meta-analysis by investigating the moderating role of the societal context in which the motivation to start a business emerges and proceeds. The results, based on 119 studies analyzing 129 samples with 266,958 individuals from 36 countries, show smaller gender differences than the original study and reveal little differences across cultural regions in the effects of the tested model. A meta-regression analyzing the role of specific cultural dimensions and economic factors on gender-related correlations reveals significant effects only of gender egalitarianism and in the opposite direction as expected. In summary, the study contributes to the discussion on gender differences, the importance of study replications and updates of meta-analyses, and the generalizability of theories across cultural contexts. Dataset for: Steinmetz, H., Isidor, R., & Bauer, C. (2021). Gender Differences in the Intention to Start a Business. Zeitschrift Für Psychologie, 229(1), 70–84. https://doi.org/10.1027/2151-2604/a000435: Electronic supplementary material D - Data file
Facebook
TwitterBy Ali Prasla [source]
The Online Retail Sales Dataset, often referred to as the Online Retail.csv file, is an extensive and comprehensive collection of data points relating to e-commerce transactions. This dataset provides a detailed view of sales activities within the online retail sector, covering numerous essential attributes necessary for a quantitative understanding of consumer behavior and the overall business performance.
One of the key elements covered in this dataset is 'InvoiceNo', which is a unique identifier for each transaction taking place in this retail environment. Given its uniqueness, it serves as a primary key for distinguishing individual transactions. It's worthwhile to note that these Invoice Numbers are numerical values.
Another important attribute included here is 'StockCode'. Each product listed or sold on this online retail platform has been assigned with its unique identification code or StockCode. These codes are also numerical values that offer another layer to clearly classify items and distinguish one from another.
For further understanding, every product comes with a basic description noted under the 'Description' column. In textual form, these descriptions provide insights into what exactly each product item entails. Aside from aiding identification efforts, they can potentially open avenues for text-based analysis such as sentiment analysis or keyword flagging based on product trends.
'Moving onto details about transactions themselves', we have two crucial columns: 'Quantity' and 'UnitPrice'. As their names suggest, these show respectively how many particular units of an item were sold per transaction and at what price per unit was sold at.
Further adding detail to our transactions information comes 'InvoiceDate', which records when each separate purchase occurred down to accurate date & time records. This data can be pivotal in recognizing sales patterns throughout different periods or predicting future trends based on historical timing behavior.
Finally yet importantly comes our global indicator - The ‘Country’ column specifies various countries where customers reside who interacts with this particular online platform regularly by making purchases. This application allows us insights into the geographical dispersion of user base across various countries, potentially providing us insights into regional preferences or global market segmentation.
Ith such a wealth of detailed transaction records and customer information, the Online Retail.csv dataset stands as an invaluable tool for those looking to delve deep into online retail sales data analysis. The possibilities with this dataset are vast, ranging from shaping efficient marketing strategies based on geographical data to predicting sales & growth metrics using historical behavior and much more
Here's how to make best use of this dataset:
Getting Started Before you start analyzing your data – you'll have to load it into statistical software such as Python (using pandas library) or R. The dataset is saved in .csv file format which supports easy reading into most data manipulation software.
Understand The Fields
InvoiceNo: Each transaction made has an associated unique numerical identifier called InvoiceNo. Consider it like a receipt code - these allow for tracking individual transactions.
StockCode: To identify each product uniquely during analysis, refer to each StockCode value which is essentially a product identification code.
Description: A brief textual description about each product that can be invaluable when dealing with categories for market-basket type analysis.
Quantity: Each row lists out how many units of a particular item were involved in a single transaction - watch out for very large values as they might represent bulk orders.
decode 3
code point 747
hidden fields exercise difficulty
coding dictionary letters
decipher hidden message codes
dictionary letters python
a word scramble solution .
hidden language symbols
unscramble words solver codes
descriptions quizlet game zones
hidden words gameplay notes
name that symbol solutions pack.
11.russian alphabet chart deciphered key .
12.writing numbers in words worksheets grade 1 difficulty
13.cool letter symbols copy and paste trick
14.solve the equation by factoring puzzle answers...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SocialGrep [source]
A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.
Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.
In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.
You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes
- Finding correlations between different types of datasets
- Determining which datasets are most popular on Reddit
- Analyzing the sentiments of post and comments on Reddit's /r/datasets board
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |
File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
Facebook
TwitterThis module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
Facebook
TwitterThis data release supports an analysis of changes in dissolved organic carbon (DOC) and nitrate concentrations in Buck Creek watershed near Inlet, New York 2001 to 2021. The Buck Creek watershed is a 310-hectare forested watershed that is recovering from acidic deposition within the Adirondack region. The data release includes pre-processed model inputs and model outputs for the Weighted Regressions on Time, Discharge and Season (WRTDS) model (Hirsch and others, 2010) to estimate daily flow normalized concentrations of DOC and nitrate during a 20-year period of analysis. WRTDS uses daily discharge and concentration observations implemented through the Exploration and Graphics for River Trends R package (EGRET) to predict solute concentration using decimal time and discharge as explanatory variables (Hirsch and De Cicco, 2015; Hirsch and others, 2010). Discharge and concentration data are available from the U.S. Geological Survey National Water Information System (NWIS) database (U.S. Geological Survey, 2016). The time series data were analyzed for the entire period, water years 2001 (WY2001) to WY2021 where WY2001 is the period from October 1, 2000 to September 30, 2001. This data release contains 5 comma-separated values (CSV) files, one R script, and one XML metadata file. There are four input files (“Daily.csv”, “INFO.csv”, “Sample_doc.csv”, and “Sample_nitrate.csv”) that contain site information, daily mean discharge, and mean daily DOC or nitrate concentrations. The R script (“Buck Creek WRTDS R script.R”) uses the four input datasets and functions from the EGRET R package to generate estimations of flow normalized concentrations. The output file (“WRTDS_results.csv”) contains model output at daily time steps for each sub-watershed and for each solute. Files are automatically associated with the R script when opened in RStudio using the provided R project file ("Files.Rproj"). All input, output, and R files are in the "Files.zip" folder.
Facebook
TwitterThis data set contains QA/QC-ed (Quality Assurance and Quality Control) water level data for the PLM1 and PLM6 wells. PLM1 and PLM6 are location identifiers used by the Watershed Function SFA project for two groundwater monitoring wells along an elevation gradient located along the lower montane life zone of a hillslope near the Pumphouse location at the East River Watershed, Colorado, USA. These wells are used to monitor subsurface water and carbon inventories and fluxes, and to determine the seasonally dependent flow of groundwater under the PLM hillslope. The downslope flow of groundwater in combination with data on groundwater chemistry (see related references) can be used to estimate rates of solute export from the hillslope to the floodplain and river. QA/QC analysis of measured groundwater levels in monitoring wells PLM-1 and PLM-6 included identification and flagging of duplicated values of timestamps, gap filling of missing timestamps and water levels, removal of abnormal/bad and outliers of measured water levels. The QA/QC analysis also tested the application of different QA/QC methods and the development of regular (5-minute, 1-hour, and 1-day) time series datasets, which can serve as a benchmark for testing other QA/QC techniques, and will be applicable for ecohydrological modeling. The package includes a Readme file, one R code file used to perform QA/QC, a series of 8 data csv files (six QA/QC-ed regular time series datasets of varying intervals (5-min, 1-hr, 1-day) and two files with QA/QC flagging of original data), and three files for the reporting format adoption of this dataset (InstallationMethods, file level metadata (flmd), and data dictionary (dd) files).QA/QC-ed data herein were derived from the original/raw data publication available at Williams et al., 2020 (DOI: 10.15485/1818367). For more information about running R code file (10.15485_1866836_QAQC_PLM1_PLM6.R) to reproduce QA/QC output files, see README (QAQC_PLM_readme.docx). This dataset replaces the previously published raw data time series, and is the final groundwater data product for the PLM wells in the East River. Complete metadata information on the PLM1 and PLM6 wells are available in a related dataset on ESS-DIVE: Varadharajan C, et al (2022). https://doi.org/10.15485/1660962. These data products are part of the Watershed Function Scientific Focus Area collection effort to further scientific understanding of biogeochemical dynamics from genome to watershed scales. 2022/09/09 Update: Converted data files using ESS-DIVE’s Hydrological Monitoring Reporting Format. With the adoption of this reporting format, the addition of three new files (v1_20220909_flmd.csv, V1_20220909_dd.csv, and InstallationMethods.csv) were added. The file-level metadata file (v1_20220909_flmd.csv) contains information specific to the files contained within the dataset. The data dictionary file (v1_20220909_dd.csv) contains definitions of column headers and other terms across the dataset. The installation methods file (InstallationMethods.csv) contains a description of methods associated with installation and deployment at PLM1 and PLM6 wells. Additionally, eight data files were re-formatted to follow the reporting format guidance (er_plm1_waterlevel_2016-2020.csv, er_plm1_waterlevel_1-hour_2016-2020.csv, er_plm1_waterlevel_daily_2016-2020.csv, QA_PLM1_Flagging.csv, er_plm6_waterlevel_2016-2020.csv, er_plm6_waterlevel_1-hour_2016-2020.csv, er_plm6_waterlevel_daily_2016-2020.csv, QA_PLM6_Flagging.csv). The major changes to the data files include the addition of header_rows above the data containing metadata about the particular well, units, and sensor description. 2023/01/18 Update: Dataset updated to include additional QA/QC-ed water level data up until 2022-10-12 for ER-PLM1 and 2022-10-13 for ER-PLM6. Reporting format specific files (v2_20230118_flmd.csv, v2_20230118_dd.csv, v2_20230118_InstallationMethods.csv) were updated to reflect the additional data. R code file (QAQC_PLM1_PLM6.R) was added to replace the previously uploaded HTML files to enable execution of the associated code. R code file (QAQC_PLM1_PLM6.R) and ReadMe file (QAQC_PLM_readme.docx) were revised to clarify where original data was retrieved from and to remove local file paths.
Facebook
TwitterWelcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.
ASK
Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?
PREPARE
I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign
Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.
Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.
PROCESS
I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries:
R
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(tidyr)
I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:
csv_files <- list.files(pattern = "\\.csv$")
data_frames <- list()
for (file in csv_files) {
variable_name <- sub("\\.csv$", "", file)
assign(variable_name, read.csv(file))
data_frames[[variable_name]] <- get(variable_name)
}
Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.
data_frames <- lapply(data_frames, function(df) {
df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_pledged <- as.numeric(df$usd_pledged)
return(df)
})
In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):
all_nov_2023 = bind_rows(data_frames)
all_dec_2023 = bind_rows(data_frames)
all_jan_2024 = bind_rows(data_frames)`
After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.
filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month
select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>%
filter(created_at != 0 & deadline != 0 & launched_at != 0) %>%
mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>%
mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>%
mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>%
mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>%
select(-category, -deadline, -launched_at, -created_at) %>%
relocate(created, launched, setted_deadline, .before = goal)
write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)
The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...
Facebook
TwitterThis data release comprises the data files and code necessary to perform all analyses presented in the associated publication. The *.csv data files are aggregations of water extent on the basis of the European Commission's Joint Research Centre (JRC) Monthly Water History database (v1.0) and the Dynamic Surface Water Extent (DSWE) algorithm. The shapefile dataset contains the study area 8-digit hydrologic unit code (HUC) regions used as the basis for analysis. Html files provide an overview of the study workflow and integrated R notebooks (in .Rmd format) for recreating all project results and plots. The R notebook ingest the necessary data files from their online locations. These data support the following publication: Walker JJ, Soulard CE, Petrakis RE. In press. Integrating stream gage data and Landsat imagery to complete time-series of surface water extents in Central Valley, California. International Journal of Applied Earth Observation and Geoinformation, http://dx.doi.org/xx.xxxxx/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.
Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.
I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.
Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html
Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy
Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)
This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.
One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.
I am also interested to see what others did with this analysis - what were the findings and insights you found?
Facebook
TwitterThis archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The DIAMAS project investigates Institutional Publishing Service Providers (IPSP) in the broadest sense, with a special focus on those publishing initiatives that do not charge fees to authors or readers. To collect information on Institutional Publishing in the ERA, a survey was conducted among IPSPs between March-May 2024. This dataset contains aggregated data from the 685 valid responses to the DIAMAS survey on Institutional Publishing.
The dataset supplements D2.3 Final IPSP landscape Report Institutional Publishing in the ERA: results from the DIAMAS survey.
The data
Basic aggregate tabular data
Full individual survey responses are not being shared to prevent the easy identification of respondents (in line with conditions set out in the survey questionnaire). This dataset contains full tables with aggregate data for all questions from the survey, with the exception of free-text responses, from all 685 survey respondents. This includes, per question, overall totals and percentages for the answers given as well the breakdown by both IPSP-types: institutional publishers (IPs) and service providers (SPs). Tables at country level have not been shared, as cell values often turned out to be too low to prevent potential identification of respondents. The data is available in csv and docx formats, with csv files grouped and packaged into ZIP files. Metadata describing data type, question type, as well as question response rate, is available in csv format. The R code used to generate the aggregate tables is made available as well.
Files included in this dataset
survey_questions_data_description.csv - metadata describing data type, question type, as well as question response rate per survey question.
tables_raw_all.zip - raw tables (csv format) with aggregated data per question for all respondents, with the exception of free-text responses. Questions with multiple answers have a table for each answer option. Zip file contains 180 csv files.
tables_raw_IP.zip - as tables_raw_all.zip, for responses from institutional publishers (IP) only. Zip file contains 180 csv files.
tables_raw_SP.zip - as tables_raw_all.zip, for responses from service providers (SP) only. Zip file contains 170 csv files.
tables_formatted_all.docx - formatted tables (docx format) with aggregated data per question for all respondents, with the exception of free-text responses. Questions with multiple answers have a table for each answer option.
tables_formatted_IP.docx - as tables_formatted_all.docx, for responses from institutional publishers (IP) only.
tables_formatted_SP.docx - as tables_formatted_all.docx, for responses from service providers (SP) only.
DIAMAS_Tables_single.R - R script used to generate raw tables with aggregated data for all single response questions
DIAMAS_Tables_multiple.R - R script used to generate raw tables with aggregated data for all multiple response questions
DIAMAS_Tables_layout.R - R script used to generate document with formatted tables from raw tables with aggregated data
DIAMAS Survey on Instititutional Publishing - data availability statement (pdf)
All data are made available under a CC0 license.
Facebook
TwitterThis child page contains a zipped folder which contains all items necessary to run trend models and produce results published in U.S. Geological Scientific Investigations Report 2021–XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p.]. To run the R-QWTREND program in R 6 files are required and each is included in this child page: prepQWdataV4.txt, runQWmodelV4XXUEP.txt, plotQWtrendV4XXUEP.txt, qwtrend2018v4.exe, salflibc.dll, and StartQWTrendV4.R (Vecchia and Nustad, 2020). The folder contains: six items required to run the R–QWTREND trend analysis tool; a readme.txt file; a flowtrendData.RData file; an allsiteinfo.table.csv file, a folder called "scripts", and a folder called "waterqualitydata". The "scripts" folder contains the scripts that can be used to reproduce the results found in the USGS Scientific Investigations Report referenced above. The "waterqualitydata" folder contains .csv files with the naming convention of site_ions or site_nuts for major ions and nutrients constituents and contains machine readable files with the water-quality data used for the trend analysis at each site. R–QWTREND is a software package for analyzing trends in stream-water quality. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for using R–QWTREND: • Windows 10 operating system • R (version 3.4 or later; 64 bit recommended) • RStudio (version 1.1.456 or later). An accompanying report (Vecchia and Nustad, 2020) serves as the formal documentation for R–QWTREND. Vecchia, A.V., and Nustad, R.A., 2020, Time-series model, statistical methods, and software documentation for R–QWTREND—An R package for analyzing trends in stream-water quality: U.S. Geological Survey Open-File Report 2020–1014, 51 p., https://doi.org/10.3133/ofr20201014 R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Obtaining all types of data (Numerical, Temporal, Image, categorical, CSV, Dicom) in a short and malleable format for quick and easy use was something that I, as a learner, wished I had. The huge and complex nature of publicly available datasets were sometimes too intimidating for beginners and for professionals when they want to do just a quick sanity check for their algorithm on another dataset. So this dataset aims to solve exactly that problem.
The ****Diverse Algorithms Analysis Dataset (DAAD)**** contains several different types of datasets all grouped into one for easy access to a learner. It contains concise, and well-documented data to help you jump start your implementations of algorithms.
A user can use this dataset in several ways:
This dataset is intended to be dynamic. But the current version contains the following:
Pokemon_categorical: A CSV file that contains different information relating to every pokemon in a categorical format. Information such as abilities, attack, defense, points, etc is present. The objective is to predict whether a pokemon is legendary or not. So a typical binary classification problem.
Pokemon_numerical: A CSV file pretty similar to Pokemon_categorical but with a lesser number of categorical features and more stress on numeric scores like points, HP, Generation including attack, special attack, defense scores, etc. The objective is once again a binary classification of whether a pokemon is legendary or not.
Stock_forecasting: A CSV file that contains the stock price of a multinational company obtained over a continuous rolling 2 year period. Ideal for beginners to dive into stock-prediction and for training simple to complex regression models. Best results obtained using sequence training models like RNNs, LSTMs or GRUs
Temperatures_3_years: A CSV file that contains the daily minimum temperatures of a city recorded over a rolling 3 year period. The objective can be modeled according to user needs. ou may choose to predict the temperatures for the next month or a day-wise prediction as well. This dataset performs very well with LSTMs and shows considerable performance on boosting algorithms.
License plate number detection: This dataset contains about 120 train and 50 test images ( a compact version of a larger dataset) of number plates of cars. The user can try out several ROI-pooling, Image localization and detection techniques along with implementing some cool OCRs on the dataset. The small size of the dataset can help you train faster and help you generalize easily. Ideal for a beginner to Computer Vision.
University_Recruitment_Data: This contains information which encompasses the bio-data of a student and his/her credentials. The work experience, degree percentage, and other such relevant factors are present. The objective is basically to solve a simple binary classification problem of whether the student will be recruited or not.
(to be contd...)
As I initially mentioned, it would have been a valuable resource for me to have such a dataset where I can train and deploy my models with relative ease and have to worry less about scavenging through several data sources. I intend DAAD to be a repository that can facilitate the needs of all types of ML enthusiasts/developers. I would appreciate contributions from my fellow kagglers too in enriching this dataset making it reachable to all and ideal for simple and quick implementations without losing out on the reliability factor as in huge datasets.
Have Fun !
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corresponding peer-reviewed publication
This dataset corresponds to all the RAPID input and output files that were used in the study reported in:
David, Cédric H., Florence Habets, David R. Maidment and Zong-Liang Yang (2011), RAPID applied to the SIM-France model, Hydrological Processes, 25(22), 3412-3425. DOI: 10.1002/hyp.8070.
When making use of any of the files in this dataset, please cite both the aforementioned article and the dataset herein.
Time format
The times reported in this description all follow the ISO 8601 format. For example 2000-01-01T16:00-06:00 represents 4:00 PM (16:00) on Jan 1st 2000 (2000-01-01), Central Standard Time (-06:00). Additionally, when time ranges with inner time steps are reported, the first time corresponds to the beginning of the first time step, and the second time corresponds to the end of the last time step. For example, the 3-hourly time range from 2000-01-01T03:00+00:00 to 2000-01-01T09:00+00:00 contains two 3-hourly time steps. The first one starts at 3:00 AM and finishes at 6:00AM on Jan 1st 2000, Universal Time; the second one starts at 6:00 AM and finishes at 9:00AM on Jan 1st 2000, Universal Time.
Data sources
The following sources were used to produce files in this dataset:
The hydrographic network of SIM-France, as published in Habets, F., A. Boone, J. L. Champeaux, P. Etchevers, L. Franchistéguy, E. Leblois, E. Ledoux, P. Le Moigne, E. Martin, S. Morel, J. Noilhan, P. Quintana Seguí, F. Rousset-Regimbeau, and P. Viennot (2008), The SAFRAN-ISBA-MODCOU hydrometeorological model applied over France, Journal of Geophysical Research: Atmospheres, 113(D6), DOI: 10.1029/2007JD008548.
The observed flows are from Banque HYDRO, Service Central d'Hydrométéorologie et d'Appui à la Prévision des Inondations. Available at http://www.hydro.eaufrance.fr/index.php.
Outputs from a simulation using SIM-France (Habets et al. 2008). The simulation was run by Florence Habets, and produced 3-hourly time steps from 1995-08-01T00:00+02:00 to 2005-07-31T21:02+00:00. Further details on the inputs and options used for this simulation are provided in David et al. (2011).
Software
The following software were used to produce files in this dataset:
The Routing Application for Parallel computation of Discharge (RAPID, David et al. 2011, http://rapid-hub.org), Version 1.1.0. Further details on the inputs and options used for this series of simulations are provided below and in David et al. (2011).
ESRI ArcGIS (http://www.arcgis.com).
Microsoft Excel (https://products.office.com/en-us/excel).
The GNU Compiler Collection (https://gcc.gnu.org) and the Intel compilers (https://software.intel.com/en-us/intel-compilers).
Study domain
The files in this dataset correspond to one study domain:
The river network of SIM-France is made of 24264 river reaches. The temporal range corresponding to this domain is from 1995-08-01T00:00+02:00 to 2005-07-31 T21:00+02:00.
Description of files
All files below were prepared by Cédric H. David, using the data sources and software mentioned above.
rapid_connect_France.csv. This CSV file contains the river network connectivity information and is based on the unique IDs of the SIM-France river reaches (the IDs). For each river reach, this file specifies: the ID of the reach, the ID of the unique downstream reach, the number of upstream reaches with a maximum of four reaches, and the IDs of all upstream reaches. A value of zero is used in place of NoData. The river reaches are sorted in increasing value of ID. The values were computed based on the SIM-France FICVID file. This file was prepared using a Fortran program.
m3_riv_France_1995_2005_ksat_201101_c_zvol_ext.nc. This netCDF file contains the 3-hourly accumulated inflows of water (in cubic meters) from surface and subsurface runoff into the upstream point of each river reach. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The time range for this file is from 1995-08-01T00:00+02:00 to 2005/07/31T21:00+02:00. The values were computed using the outputs of SIM-France. This file was prepared using a Fortran program.
kfac_modcou_1km_hour.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The values were computed based on the following information: ID, size of the side of the grid cell, Equation (5) in David et al. (2011), and using a wave celerity of 1 km/h. This file was prepared using a Fortran program.
kfac_modcou_ttra_length.csv. This CSV file contains a second guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The values were computed based on the following information: ID, size of the side of the grid cell, travel time, and Equation (9) in David et al. (2011).
k_modcou_0.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_1.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_2.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_3.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_4.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_a.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_b.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_c.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_0.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_1.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_2.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_3.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_4.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_a.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_b.csv. This CSV file contains Muskingum x values
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corresponding peer-reviewed publication
This dataset corresponds to all the RAPID input and output files that were used in the study reported in:
David, Cédric H., David R. Maidment, Guo-Yue Niu, Zong-Liang Yang, Florence Habets and Victor Eijkhout (2011), River Network Routing on the NHDPlus Dataset, Journal of Hydrometeorology, 12(5), 913-934. DOI: 10.1175/2011JHM1345.1.
When making use of any of the files in this dataset, please cite both the aforementioned article and the dataset herein.
Time format
The times reported in this description all follow the ISO 8601 format. For example 2000-01-01T16:00-06:00 represents 4:00 PM (16:00) on Jan 1st 2000 (2000-01-01), Central Standard Time (-06:00). Additionally, when time ranges with inner time steps are reported, the first time corresponds to the beginning of the first time step, and the second time corresponds to the end of the last time step. For example, the 3-hourly time range from 2000-01-01T03:00+00:00 to 2000-01-01T09:00+00:00 contains two 3-hourly time steps. The first one starts at 3:00 AM and finishes at 6:00AM on Jan 1st 2000, Universal Time; the second one starts at 6:00 AM and finishes at 9:00AM on Jan 1st 2000, Universal Time.
Data sources
The following sources were used to produce files in this dataset:
The National Hydrography Dataset Plus (NHDPlus) Version 1, obtained from http://www.horizon-systems.com/nhdplus.
The National Water Information System (NWIS), obtained from http://waterdata.usgs.gov/nwis.
Outputs from a simulation using the community Noah land surface model with multiparameterization options (Noah-MP, Niu et al. 2011, http://www.jsg.utexas.edu/noah-mp). The simulation was run by Guo-Yue Niu, and produced 3-hourly time steps from 2004-01-01T00:00+00:00 to 2008-01-01T00:00+00:00. Further details on the inputs and options used for this simulation are provided in David et al. (2011).
Software
The following software were used to produce files in this dataset:
The Routing Application for Parallel computation of Discharge (RAPID, David et al. 2011, http://rapid-hub.org), Version 1.0.0. Further details on the inputs and options used for this series of simulations are provided below and in David et al. (2011).
ESRI ArcGIS (http://www.arcgis.com).
Microsoft Excel (https://products.office.com/en-us/excel).
CUAHSI HydroGET (http://his.cuahsi.org/hydroget.html).
The GNU Compiler Collection (https://gcc.gnu.org) and the Intel compilers (https://software.intel.com/en-us/intel-compilers).
Study domain
The files in this dataset correspond to two study domains:
The combination of the San Antonio and Guadalupe River Basins, TX. RAPID can only use the river reaches of NHDPlus that have a known flow direction and focus is made on these reaches here (a total of 5,175). The temporal range corresponding to this domain is from 2004-01-01T00:00-06:00 to 2007-12-31 T00:00-06:00.
The Upper Mississippi River Basin. RAPID can only use the river reaches of NHDPlus that have a known flow direction and focus is made on these reaches here (a total of 182,240). The temporal range corresponding to this domain spans 100 fictitious days.
Description of files for the San Antonio and Guadalupe River Basins
All files below were prepared by Cédric H. David, using the data sources and software mentioned above.
rapid_connect_San_Guad.csv. This CSV file contains the river network connectivity information and is based on the unique IDs of NHDPlus reaches (the COMIDs). For each river reach, this file specifies: the COMID of the reach, the COMID of the unique downstream reach, the number of upstream reaches with a maximum of four reaches, and the COMIDs of all upstream reaches. A value of zero is used in place of NoData. The river reaches are sorted in increasing value of COMID. The values were computed using a combination of the following NHDPlus fields: COMID, DIVERGENCE, FROMNODE and TONODE. This file was prepared using ArcGIS and Excel.
m3_riv_San_Guad_2004_2007_cst.nc. This netCDF file contains the 3-hourly accumulated inflows of water (in cubic meters) from surface and subsurface runoff into the upstream point of each river reach. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007/12/31T18:00-06:00. The values were computed by superimposing a 900-m gridded map of NHDPlus catchments to the outputs of Noah-MP. This file was prepared using ArcGIS and a Fortran program.
kfac_San_Guad_1km_hour.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, Equation (13) in David et al. (2011), and using a wave celerity of 1 km/h. This file was prepared using a Fortran program.
kfac_San_Guad_celerity.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, Equation (13) in David et al. (2011), and using the wave celerity numbers of Table 2 in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_1.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (17) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_2.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (18) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_3.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (19) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_4.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (21) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_1.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (17) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_2.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (18) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_3.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (19) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_4.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (21) in David et al. (2011). This file was prepared using a Fortran program.
basin_id_San_Guad_hydroseq.csv. This CSV file contains the list of unique IDs of NHDPlus river reaches (COMID) in the San Antonio and Guadalupe River Basins. The river reaches are sorted from upstream to downstream. The values were computed using the following NHDPlus fields: COMID and HYDROSEQ. This file was prepared using Excel.
Qout_San_Guad_1460days_p1_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream point of each reach. The river reaches have the same COMIDs and are sorted similarly to basin_id_San_Guad_hydroseq.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007-12-31-00:00-06:00. The values were computed using the Muskingum method with parameters of Equation (17) in David et al. (2011). This file was prepared using RAPID v1.0.0 running with the preonly ILU solver on one core.
Qout_San_Guad_1460days_p2_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream point of each reach. The river reaches have the same COMIDs and are sorted similarly to basin_id_San_Guad_hydroseq.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007-12-31-00:00-06:00. The values were computed using the Muskingum method with parameters of Equation (18) in David et al. (2011). This file was prepared using RAPID v1.0.0 running with the preonly ILU solver on one core.
Qout_San_Guad_1460days_p3_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `