11 datasets found
  1. d

    Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

    • search.dataone.org
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
    Description

    This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.

  2. riiid_train_converted to Multiple Formats

    • kaggle.com
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santh Raul (2021). riiid_train_converted to Multiple Formats [Dataset]. https://www.kaggle.com/santhraul/riiid-train-converted-to-multiple-formats/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Santh Raul
    Description

    Context

    Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.

    Content

    Train data of Riiid competition in different formats.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.

  3. f

    OES_RSI Dataset

    • figshare.com
    docx
    Updated Aug 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tricia Salzar; Mark Benden (2020). OES_RSI Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12753545.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Aug 3, 2020
    Dataset provided by
    figshare
    Authors
    Tricia Salzar; Mark Benden
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data was colected from a large company as part of a corporate wellness program. The data were provide by Cority Enviance and collected via their data logging software RSIGuard and Remedy's Interactive Workplace Injury Prevention Program. The data is located in the CSV file and contains measures of discomfort and computer utilization for the three years of interest. Data cover 2012-2015 and include three computer use records per participant (28, 91 and 364 days) and two discomfort and workstation measurements per participant. The data is de-identified (includes a study ID) and contains no demographics for participants. Variable descriptions are included in the separate word file.

  4. r

    Data from: Data files used to study the distribution of growth in software...

    • researchdata.edu.au
    Updated May 4, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swinburne University of Technology (2011). Data files used to study the distribution of growth in software systems [Dataset]. https://researchdata.edu.au/files-used-study-software-systems/14865
    Explore at:
    Dataset updated
    May 4, 2011
    Dataset provided by
    Swinburne University of Technology
    Description

    The evolution of a software system can be studied in terms of how various properties as reflected by software metrics change over time. Current models of software evolution have allowed for inferences to be drawn about certain attributes of the software system, for instance, regarding the architecture, complexity and its impact on the development effort. However, an inherent limitation of these models is that they do not provide any direct insight into where growth takes place. In particular, we cannot assess the impact of evolution on the underlying distribution of size and complexity among the various classes. Such an analysis is needed in order to answer questions such as 'do developers tend to evenly distribute complexity as systems get bigger?', and 'do large and complex classes get bigger over time?'. These are questions of more than passing interest since by understanding what typical and successful software evolution looks like, we can identify anomalous situations and take action earlier than might otherwise be possible. Information gained from an analysis of the distribution of growth will also show if there are consistent boundaries within which a software design structure exists. In our study of metric distributions, we focused on 10 different measures that span a range of size and complexity measures. The raw metric data (4 .txt files and 1 .log file in a .zip file measuring ~0.5MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).

  5. AV : Healthcare Analytics

    • kaggle.com
    zip
    Updated Sep 13, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shivan kumar (2020). AV : Healthcare Analytics [Dataset]. https://www.kaggle.com/shivan118/healthcare-analytics
    Explore at:
    zip(1591838 bytes)Available download formats
    Dataset updated
    Sep 13, 2020
    Authors
    shivan kumar
    Description

    Context

    MedCamp organizes health camps in several cities with low work-life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them the facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of the camp).

    MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and the Number of people taking tests at the Camps. In the last 4 years, they have stored data of ~110,000 registrations they have done.

    One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than the required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than the required inventory for conducting these medical checks, people end up having bab experience.

    The Process:

    1. MedCamp employees/volunteers reach out to people and drive registrations.
    2. During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of the health camp.

    Other things to note:

    • Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
    • For a few camps, there was a hardware failure, so some information about the date and time of registration is lost.
    • MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

    Favorable outcome:

    • For the first 2 formats, a favorable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
    • You need to predict the chances (probability) of having a favorable outcome.

    Data Description

    Train.zip contains the following 6 csv alongside the data dictionary that contains definitions for each variable

    Health_Camp_Detail.csv – File containing Health_Camp_Id, Camp_Start_Date, Camp_End_Date and Category details of each camp.

    Train.csv – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

    Patient_Profile.csv – This file contains Patient profile details like Patient_ID, Online_Follower, Social media details, Income, Education, Age, First_Interaction_Date, City_Type and Employer_Category

    First_Health_Camp_Attended.csv – This file contains details about people who attended health camp of first format. This includes Donation (amount) & Health_Score of the person.

    Second_Health_Camp_Attended.csv - This file contains details about people who attended health camp of second format. This includes Health_Score of the person.

    Third_Health_Camp_Attended.csv - This file contains details about people who attended health camp of third format. This includes Number_of_stall_visited & Last_Stall_Visited_Number.

    Test Set

    Test.csv – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

    Train / Test split:

    Camps started on or before 31st March 2006 are considered in Train Test data is for all camps conducted on or after 1st April 2006.

    Sample Submission:

    Patient_ID: Unique Identifier for each patient. This ID is not sequential in nature and can not be used in modeling

    Health_Camp_ID: Unique Identifier for each camp. This ID is not sequential in nature and can not be used in modeling

    Outcome: Predicted probability of a favorable outcome.

    Evaluation Metric

    The evaluation metric for this hackathon is ROC-AUC Score.

  6. Z

    Songdo Traffic: High Accuracy Georeferenced Vehicle Trajectories from a...

    • data.niaid.nih.gov
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geroliminis, Nikolas (2025). Songdo Traffic: High Accuracy Georeferenced Vehicle Trajectories from a Large-Scale Study in a Smart City [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13828383
    Explore at:
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Fonod, Robert
    Geroliminis, Nikolas
    Yeo, Hwasoo
    Cho, Haechan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Songdo-dong
    Description

    Overview

    The Songdo Traffic dataset delivers precisely georeferenced vehicle trajectories captured through high-altitude bird's-eye view (BeV) drone footage over Songdo International Business District, South Korea. Comprising approximately 700,000 unique trajectories, this resource represents one of the most extensive aerial traffic datasets publicly available, distinguishing itself through exceptional temporal resolution that captures vehicle movements at 29.97 points per second, enabling unprecedented granularity for advanced urban mobility analysis.

    ⚠️ Important: If you use this dataset in your work, please cite the following reference [1]:

    Robert Fonod, Haechan Cho, Hwasoo Yeo, Nikolas Geroliminis (2025). Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery, arXiv preprint arXiv:2411.02136.

    (Note: This manuscript shall be replaced by the published version once available.)

    Dataset Composition

    The dataset consists of four primary components:

    Trajectory Data: 80 ZIP archives containing high-resolution vehicle trajectories with georeferenced positions, speeds and acceleration profiles, and other metadata.

    Orthophoto Cut-Outs: High-resolution (8000×8000 pixel) orthophoto images for each monitored intersection, used for georeferencing and visualization.

    Road and Lane Segmentations: CSV files defining lane polygons within road sections, facilitating mapping of vehicle positions to road segments and lanes.

    Sample Videos: A selection of 4K UHD drone video samples capturing intersection footage during the experiment.

    Data Collection

    The dataset was collected as part of a collaborative multi-drone experiment conducted by KAIST and EPFL in Songdo, South Korea, from October 4–7, 2022.

    A fleet of 10 drones monitored 20 busy intersections, executing advanced flight plans to optimize coverage.

    4K (3840×2160) RGB video footage was recorded at 29.97 FPS from altitudes of 140–150 meters.

    Each drone flew 10 sessions per day, covering peak morning and afternoon periods.

    The experiment resulted in 12TB of 4K raw video data.

    More details on the experimental setup and data processing pipeline are available in [1].

    Data Processing

    The trajectories were extracted using geo-trax, an advanced deep learning framework designed for high-altitude UAV-based traffic monitoring. This state-of-the-art pipeline integrates vehicle detection, tracking, trajectory stabilization, and georeferencing to extract high-accuracy traffic data from drone footage.

    Key Processing Steps:

    Vehicle Detection & Tracking: Vehicles were detected and tracked across frames using a deep learning-based detector and motion-model-based tracking algorithm.

    Trajectory Stabilization: A novel track stabilization method was applied using detected vehicle bounding boxes as exclusion masks in image registration.

    Georeferencing & Coordinate Transformation: Each trajectory was transformed into global (WGS84), local Cartesian, and orthophoto coordinate systems.

    Vehicle Metadata Estimation: In addition to time-stamped vehicle trajectories, various metadata attributes were also extracted, including vehicle dimensions and type, speed, acceleration, class, lane number, road section, and visibility status.

    More details on the extraction methodology are available in [1].

    File Structure & Formats

    1. Trajectory Data (Daily Intersection ZIPs, 16.2 MB ~ 360.2 MB)

    The trajectory data is organized into 80 ZIP files, each containing traffic data for a specific intersection and day of the experiment.

    File Naming Convention:

    YYYY-MM-DD_intersectionID.zip

    YYYY-MM-DD represents the date of data collection (2022-10-04 to 2022-10-07).

    intersectionID is a unique identifier for one of the 20 intersections where data was collected (A, B, C, E, …, U). The letter D is reserved to denote "Drone".

    Each ZIP file contains 10 CSV files, each corresponding to an individual flight session:

    YYYY-MM-DD_intersectionID.zip │── YYYY-MM-DD_intersectionID_AM1.csv ├── … │── YYYY-MM-DD_intersectionID_AM5.csv │── YYYY-MM-DD_intersectionID_PM1.csv ├── … └── YYYY-MM-DD_intersectionID_PM5.csv

    Here, AM1-AM5 and PM1-PM5 denote morning and afternoon flight sessions, respectively. For example, 2022-10-04_S_AM1.csv contains all extracted trajectories from the first morning session of the first day at the intersection 'S'.

    CSV File Example Structure:

    Each CSV file contains high-frequency trajectory data, formatted as follows (d.p. = decimal place):

    Dataset Column Name Format / Units Data Type Explanation

    Vehicle_ID 1, 2, … Integer Unique vehicle identifier within each CSV file

    Local_Time hh:mm:ss.sss String Local Korean time (GMT+9) in ISO 8601 format

    Drone_ID 1, 2, …, 10 Integer Unique identifier for the drone capturing the data

    Ortho_X, Ortho_Y px (1 d.p.) Float Vehicle center coordinates in the orthophoto cut-out image

    Local_X, Local_Y m (2 d.p.) Float KGD2002 / Central Belt 2010 planar coordinates (EPSG:5186)

    Latitude, Longitude ° DD (7 d.p.) Float WGS84 geographic coordinates in decimal degrees (EPSG:4326)

    Vehicle_Length*, Vehicle_Width* m (2 d.p.) Float Estimated physical dimensions of the vehicle

    Vehicle_Class Categorical (0–3) Integer Vehicle type: 0 (car/van), 1 (bus), 2 (truck), 3 (motorcycle)

    Vehicle_Speed* km/h (1 d.p.) Float Estimated speed computed from trajectory data using Gaussian smoothing

    Vehicle_Acceleration* m/s² (2 d.p.) Float Estimated acceleration derived from smoothed speed values

    Road_Section* N_G String Road section identifier (N = node, G = lane group)

    Lane_Number* 1, 2, … Integer Lane position (1 = leftmost lane in the direction of travel)

    Visibility 0/1 Boolean 1 = fully visible, 0 = partially visible in the camera frame

    • These columns may be empty under certain conditions, see [1] for more details.
    1. Orthophoto Cut-Outs (orthophotos.zip, 1.8 GB)

    For each intersection, we provide the high-resolution orthophoto cut-outs that were used for georeferencing. These 8000×8000 pixel PNG images cover specific areas, allowing users to overlay orthophoto trajectories within the road network.

    orthophotos/ │── A.png │── B.png │── … └── U.png

    For more details on the orthophoto generation process, refer to [1].

    1. Orthophoto Segmentations (segmentations.zip, 24.9 KB)

    We provide the road and lane segmentations for each orthophoto cut-out, stored as CSV files where each row defines a lane polygon within a road section.

    Each section (N_G) groups lanes moving in the same direction, with lanes numbered sequentially from the innermost outward. The CSV files are structured as follows:

    segmentations/ │── A.csv │── B.csv │── … └── U.csv

    Each file contains the following columns:

    Section: Road section ID (N_G format).

    Lane: Lane number within the section.

    tlx, tly, blx, bly, brx, bry, trx, try: Polygon corner coordinates.

    These segmentations enabled trajectory points to be mapped to specific lanes and sections in our trajectory dataset. Vehicles outside segmented areas (e.g., intersection centers) remain unlabeled. Perspective distortions may also cause misalignments for taller vehicles.

    1. Sample Videos (sample_videos.zip, 26.8 GB)

    The dataset includes 29 video samples, each capturing the first 60 seconds of drone hovering over its designated intersection during the final session (PM5) on October 7, 2022. These high-resolution 4K videos provide additional context for trajectory analysis and visualization, complementing the orthophoto cut-outs and segmentations.

    sample_videos/ │── A_D1_2022-10-07_PM5_60s.mp4 │── A_D2_2022-10-07_PM5_60s.mp4 │── B_D1_2022-10-07_PM5_60s.mp4 │── … └── U_D10_2022-10-07_PM5_60s.mp4

    Additional Files

    README.md – Dataset documentation (this file)

    LICENSE.txt – Creative Commons Attribution 4.0 License

    Known Dataset Artifacts and Limitations

    While this dataset is designed for high accuracy, users should be aware of the following known artifacts and limitations:

    Trajectory Fragmentation: Trajectories may be fragmented for motorcycles in complex road infrastructure scenarios (pedestrian crossings, bicycle lanes, traffic signals) and for certain underrepresented truck variants. Additional fragmentations occurred when drones experienced technical issues during hovering, necessitating mid-recording splits that naturally resulted in divided trajectories.

    Vehicle ID Ambiguities: The largest Vehicle_ID in a CSV file does not necessarily indicate the total number of unique vehicles.

    Kinematic Estimation Limitations: Speed and acceleration values are derived from raw tracking data and may be affected by minor errors due to detection inaccuracies, stabilization artifacts, and applied interpolation and smoothing techniques.

    Vehicle Dimension Estimation: Estimates may be unreliable for stationary or non-axially moving vehicles and can be affected by bounding box overestimations capturing protruding vehicle parts or shadows.

    Lane and Section Assignment Inaccuracies: Perspective effects may cause vehicles with significant heights, such as trucks or buses, to be misassigned to incorrect lanes or sections in the orthophoto.

    Occasional pedestrian pair misclassifications: Rarely, two pedestrians walking side by side may be briefly mistaken for a motorcycle, but such instances are short-lived and typically removed by the short trajectory filter.

    For a comprehensive discussion of dataset limitations and validation procedures, refer to [1].

    Citation & Attribution

    Preferred Citation:

    If you use Songdo Traffic for any purpose, whether in academic research, commercial applications, open-source projects, or benchmarking efforts, please cite our accompanying manuscript [1]:

    Robert Fonod, Haechan Cho, Hwasoo Yeo, Nikolas Geroliminis (2025). Advanced computer vision for extracting georeferenced vehicle

  7. f

    Data from: Change Point Detection in WLANs with Random AP Forests

    • figshare.com
    txt
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis Huet; Jonatan Krolikowski; Jose Manuel Navarro; Fuxing Chen; dario rossi (2023). Change Point Detection in WLANs with Random AP Forests [Dataset]. http://doi.org/10.6084/m9.figshare.23566146.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 9, 2023
    Dataset provided by
    figshare
    Authors
    Alexis Huet; Jonatan Krolikowski; Jose Manuel Navarro; Fuxing Chen; dario rossi
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Artifacts from the ACM CoNEXT' 23 paper "Change Point Detection in WLANs with Random AP Forests".Collection environmentThe collection campaign targeted a real 5GHz band WLAN and lasted over two months. The WLAN is a production network with thousands of daily users, located in a single floor of a large building equipped with 33 APs.The telemetry consists of a single AP-level Key Performance Indicator (KPI), collected every minute. In particular, KPI in the dataset is the path loss measurement among pairs of APs and is aggregated in time (at a 15 minutes granularity corresponding to up to 96 samples per day) and space (for each undirected AP pair, a single average value of the pathloss is reported). This information is contained in the rapf_kpis.csv file.A portion of the dataset is further annotated with ground truth concerning 9 relevant events that alter the signals, providing a benchmark to assess the validity of the change point detection. This information is contained in the rapf_labels.csv file.ArtifactsDataset comprises two csv files: rapf_kpis.csv (corresponding to the path loss KPI time series), and rapf_labels.csv (corresponding to the ground truth events).For both datasets, rows correspond to time, while the 529 columns correspond to:the timestamp (column 1),the 528 distinct undirected pairs between the 33 APs, expressed with the format APx--APy (with x < y, as undirected KPIs are reported, for the remaining columns).Time series KPI dataEach measured path loss value is collected for a certain AP pair over a 15 minutes timeframe. During a period of two months (from Dec 02, 2021 to Feb 09, 2022), 2.2 million samples were collected among the 528 AP pairs. Missing values are introduced either by technical problems in the data collection (for 2 days among the 70 collection days), or by the inability for APs to collect path loss measurements of the distant neighbors (roughly 35% of samples are missing in the dataset). The path loss values are expressed in dB and range from 43 to 119 (median and mean of 78). Missing values are expressed as NA. In more details:Collection from 2021-12-02T00:00:00Z to 2022-02-09T23:45:00Z (70 days) following ISO-8601 date representation.In particular, all the timestamps are UTC.Among those 70 days, there are technical problems for two days (2021-12-23, and 2022-02-02) which are not present in the csv.There are 96 timeslots in a day, so the number of rows (without the header) is (70-2)x96 = 6528.Among those 6528 rows, there is one timestamp containing NA for all pairs (2021-12-18T07:30:00Z), while others have all at least one pair that is non missing.Ground truth labelsEach ground truth event contains information regarding the date of occurrence (timestamp) and of the impacted pairs. As ground truth events are scarce, the timestamps that are not present in the rapf_labels.csv do not contain any event for any AP pairs. Overall, rapf_labels.csv contains 9 rows (one per event) and 529 columns (one per each AP pair and timestamp). A cell has either the value of 0 (no change point for this pair at that timestamp) or 1 (a change point occurred at that timestamp for this pair).We provide a succinct description of the nine respective events as follows:Event 1: Dec 06, 2021 at 17:45; all 33 APs impacted.Event 2: Dec 07, 2021 at 16:00; 29 APs impacted (AP15, AP16, AP19, AP24 are not impacted, corresponding to 6 non-impacted AP pairs AP15--AP16, AP15--AP19, AP15--AP24, AP16--AP19, AP16--AP24, AP19--AP24).Event 3: Dec 09, 2021 at 17:30; 1 AP impacted (AP24 is impacted, corresponding to 32 impacted AP pairs).Event 4: Dec 14, 2021 at 16:45; 29 APs impacted (AP10, AP17, AP19, AP28 are not impacted, corresponding to 6 non-impacted AP pairs AP10--AP17, AP10--AP19, AP10--AP28, AP17--AP19, AP17--AP28, AP19--AP28).Event 5: Dec 21, 2021 at 21:30; 29 APs impacted (AP6, AP11, AP19, AP20 are not impacted, corresponding to 6 non-impacted AP pairs AP6--AP11, AP6--AP19, AP6--AP20, AP11--AP19, AP11--AP20, AP19--AP20).Event 6: Jan 04, 2022 at 16:45; all 33 APs impacted.Event 7: Jan 11, 2022 at 18:15; all 33 APs impacted.Event 8: Jan 18, 2022 at 16:30; all 33 APs impacted.Event 9: Feb 04, 2022 at 16:30; all 33 APs impacted.

  8. TBX11K Simplified - TB X-rays with bounding boxes

    • kaggle.com
    Updated Feb 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vbookshelf (2023). TBX11K Simplified - TB X-rays with bounding boxes [Dataset]. https://www.kaggle.com/datasets/vbookshelf/tbx11k-simplified/suggestions?status=pending&yourSuggestions=true
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2023
    Dataset provided by
    Kaggle
    Authors
    vbookshelf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The TBX11K dataset is a large dataset containing 11000 chest x-ray images. It's the only TB dataset that I know of that includes TB bounding boxes. This allows both classification and detection models to be trained.

    However, it can be mentally tiring to get started with this dataset. It includes many xml, json and txt files that you need to sift through to try to understand what everything means, how it all fits together and how to extract the bounding box coordinates.

    Here I've simplified the dataset. Now there's just one csv file, one folder containing the training images and one folder containing the test images.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1086574%2Fa637d3837c261605a3c2f71a18a9b6f0%2FScreenshot%202023-02-08%20at%2012.26.29.png?generation=1675834031712582&alt=media" alt="">

    Paper: Rethinking Computer-aided Tuberculosis Diagnosis

    Original TBX11K dataset on Kaggle


    Notes

    1- Please start by reading the paper. It will help you understand what everything means. 2- The original dataset was split into train and validation sets. This split is shown in the 'source' column in the data.csv file. 3- The test images are stored in the folder called "test". There are no labels for these images and I've not included them in data.csv. 4- Each bounding box is on a separate row. Therefore, the file names in the "fname" column are not unique. For example, if an image has two bounding boxes then the file name for that image will appear twice in the "fname" column. 5- The original dataset has a folder named "extra" that contains data from other TB datasets. I've not included that folder here.


    Acknowledgements

    Many thanks to the team that created the TBX11K dataset and generously made it publicly available.


    Citation

     # TBX11K dataset
     @inproceedings{liu2020rethinking,
      title={Rethinking computer-aided tuberculosis diagnosis},
      author={Liu, Yun and Wu, Yu-Huan and Ban, Yunfeng and Wang, Huifang and Cheng, Ming-Ming},
      booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      pages={2646--2655},
      year={2020}
     }
    
    


    Helpful Resources

  9. Gait Database

    • figshare.com
    zip
    Updated Jul 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nazli Rafei Dehkordi; saman farahmand (2022). Gait Database [Dataset]. http://doi.org/10.6084/m9.figshare.20346852.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 21, 2022
    Dataset provided by
    figshare
    Authors
    Nazli Rafei Dehkordi; saman farahmand
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gait recognition is the characterization of unqiue biometric patterns associated with each inidvidual which can be utilized to identify a person without direct contact. A public gain database with relatively large number of subjects can provide a great oppportunity to future studies to build and validate gait authentication models. The goal of this study is to introduce a comprehensive gait database of 93 human subjects who walked between two end points (320 meters) during two different sessions and record their gait data using two smart phones, one was attached to right thigh and another one on left side of waist. This data is collected with intention to be utilized by deep learning-based method which requires enough time points. The meta data including age, gender, smoking, daily exercise time, height, and weight of an individual is recorded. this data set is publicly available.

    Except 19 subjects who did not attend for second session, every subject is associated with 4 different log files (each session contains two log files). Every file name has one of the following patterns: · sub0-lw-s1.csv: subject number 0, left waist, session 1 · sub0-rp-s1.csv: subject number 0, right thigh, session 1 · sub0-lw-s2.csv: subject number 0, left waist, session 2 · sub0-rp-s2.csv: subject number 0, right thigh, session 2 Every log file contains 58 features that are internally captured and calculated using SensorLog app. Additionally, an Excel file contain the meta data is provided for each subject.

  10. HiRISE Image Patches Obscured by Atmospheric Dust

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gary Doran; Gary Doran (2020). HiRISE Image Patches Obscured by Atmospheric Dust [Dataset]. http://doi.org/10.5281/zenodo.3495068
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gary Doran; Gary Doran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    The purpose of this dataset is to train a classifier to detect "dusty" versus "not dusty" patches within browse-resolution HiRISE observations of the Martian surface. Here, "dusty" refers to images in which the view of the surface has been obscured heavily by atmospheric dust.

    The dataset contains two sets of 20,000 image patches each from EDR (full resolution) and RDR ("browse" resolution) non-map-projected ("nomap") HiRISE images, with balanced classes. The patches have been split into train (n = 10,000), validation (n = 5,000), and test (n = 5,000) sets such that no two patches from the same HiRISE observation appear in more than one of these subsets. There could be some noise in the labels, but a subset of the validation images have been manually vetted so that label noise rates can be estimated. More details on the dataset creation process are described below.

    Generating Candidate Images and Patches

    To begin constructing the dataset, the paper "The origin, evolution, and trajectory of large dust storms on Mars during Mars years 24–30 (1999–2011)," by Wang and Richardson (2015), was used to compile a set of time ranges for which global or regional dust storms were known to be occurring on Mars. All HiRISE RDR nomap browse images acquired within these time ranges were then inspected manually to determine sets of images that were (1) almost entirely obscured by dust and (2) almost entirely clear of dust. Then, 10,000 patches from the two subsets of images were extracted to form the "dusty" and "not dusty" classes. The extracted patches are 100-by-100 pixels, which roughly corresponds to the width of one CCD channel within the browse image (the width of the raw EDR data products that are stitched together to form a full RDR image). Some small amount of label noise is introduced in this process, since a patch from a mostly dusty image might happen to contain a clear view of the ground, and a patch from a mostly non-dusty image might contain some dust or regions on the surface that are featureless and appear like dusty patches. A set of "vetting labels" is included, which includes human annotations by the author for a subset of the validation set of patches. These labels can be used to estimate the apparent label noise in the dataset.

    Corresponding to the RDR patch dataset, a set of patches are extracted from the same set of EDR images for the "dusty" and "not dusty" classes. EDRs are raw images from the instrument that have not been calibrated or stitched together. To provide some form of normalization, EDR patches are only extracted from the lower half of the EDRs, with the upper half being used to perform a basic calibration of the lower half. Basic calibration is done by subtracting the sample (image column) averages from the upper half to remove "striping," then computing the 0.1th and 99.9th percentiles of the remaining values in the upper half and stretching the image patch to 8-bit integer values [0, 255] within that range. The calibration is meant to implement a process that could be performed onboard the spacecraft as the data is being observed (hence, using the top half of the image acquired first to calibrate the lower half of the image which is acquired later). The full resolution EDRs, which are 1024 pixels wide, are resized down to 100-by-100 pixel patches after being extracted so that they roughly match the resolution of the patches from the RDR browse images.

    Archive Contents

    The compressed archive file contains two top-level directories with similar contents, "edr_nomap_full_resized" and "rdr_nomap_browse." The first directory contains the dataset constructed from EDR data and the second contains the dataset constructed from RDR data.

    Within each directory, there are "dusty" and "not_dusty" directories containing the image patches from each class, "manifest.csv," and "vetting_labels.csv." The vetting labels file contains a list of manually labeled examples, along with the original labels to make it easier to compute label noise rates. The "manifest.csv" file contains a list of every example, its label, and whether it belongs to the train, validation, or test set.

    An example ID encodes information about where the patch was sampled from the original HiRISE image. As an example from the RDR dataset, the ID "003100_PSP_004440_2125_r4805_c512" can be broken into several parts:

    • "003100" is a unique numerical ID
    • "PSP_004440_2125" is the HiRISE observation ID
    • "r4805_c512" means the patch's upper left corner starts at the 4805th row and 512th column of the original observation

    For the EDR dataset, the ID "200000_PSP_004530_1030_RED7_1_r9153" is broken down as follows:

    • "200000" is a unique numerical ID
    • "PSP_004530_1030" is the HiRISE observation ID
    • "RED7" is the CCD ID
    • "1" is the CCD channel (either 0 or 1)
    • "r9153" means that the patch is extracted starting at the 9153rd row (since all columns of the 1024-pixel EDR are used, no column is specified; it is implicitly always 0)

    Original Data

    The original HiRISE EDR and RDR data is available via the Planetary Data System (PDS), hosted at https://hirise-pds.lpl.arizona.edu/PDS/

  11. THÖR-MAGNI: A Large-scale Indoor Motion Capture Recording of Human Movement...

    • zenodo.org
    zip
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Schreiter; Tim Schreiter; Tiago Rodrigues de Almeida; Tiago Rodrigues de Almeida; Yufei Zhu; Yufei Zhu; Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro; Andrey Rudenko; Andrey Rudenko (2024). THÖR-MAGNI: A Large-scale Indoor Motion Capture Recording of Human Movement and Interaction [Dataset]. http://doi.org/10.5281/zenodo.10554472
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tim Schreiter; Tim Schreiter; Tiago Rodrigues de Almeida; Tiago Rodrigues de Almeida; Yufei Zhu; Yufei Zhu; Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro; Andrey Rudenko; Andrey Rudenko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The THÖR-MAGNI Dataset Tutorials

    THÖR-MAGNI datasets is a novel dataset of accurate human and robot navigation and interaction in diverse indoor contexts, building on the previous THÖR dataset protocol. We provide position and head orientation motion capture data, 3D LiDAR scans and gaze tracking. In total, THÖR-MAGNI captures 3.5 hours of motion of 40 participants on 5 recording days.

    This data collection is designed around systematic variation of factors in the environment to allow building cue-conditioned models of human motion and verifying hypotheses on factor impact. To that end, THÖR-MAGNI encompasses 5 scenarios, in which some of them have different conditions (i.e., we vary some factor):

    • Scenario 1 (plus conditions A and B):
      • Participants move in groups and individually;
      • Robot as static obstacle;
      • Environment with 3 obstacles and lane marking on the floor for condition B;
    • Scenario 2:
      • Participants move in groups, individually and transport objects with variable difficulty (i.e. bucket, boxes and a poster stand);
      • Robot as static obstacle;
      • Environment with 3 obstacles;
    • Scenario 3 (plus conditions A and B):
      • Participants move in groups, individually and transporting objects with variable difficulty (i.e. bucket, boxes and a poster stand). We denote each role as: Visitors-Alone, Visitors-Group 2, Visitors-Group 3, Carrier-Bucket, Carrier-Box, Carrier-Large Object;
      • Teleoperated robot as moving agent: in condition A, the robot moves with differential drive; in condition B, the robot moves with omni-directional drive;
      • Environment with 2 obstacles;
    • Scenario 4 (plus conditions A and B):
      • All participants, denoted as Visitors-Alone HRI interacted with the teleoperated mobile robot;
      • Robot interacted in two ways: in condition A (Verbal-Only), the Anthropomorphic Robot Mock Driver (ARMoD), a small humanoid NAO robot on top of the mobile platform, only used speech to communicate the next goal point to the participant; in condition B the ARMoD used speech, gestures and robotic gaze to convey the same message;
      • Free space environment
    • Scenario 5:
      • Participants move alone (Visitors-Alone) and one of the participants, denoted as Visitors-Alone HRI, transport objects and interact with the robot;
      • The ARMoD is remotely controlled by an experimenter and proactively offers help;
      • Free space environment;

    Preliminary steps

    Before proceeding, make sure to download the data from ZENODO

    1. Directory Structure

    ├── docs

    │ ├── tutorials.md <- Tutorials document on how to use the data

    ├── goals_positions.csv <- File with the goals locations

    ├── maps <- Directory for maps of the environment (PNG files) and offsets (json file)

    │ ├── offsets.json <- Offsets of the map with respect to the global coordinate frame origin

    │ ├── {date}_SC{sc_id}_map.png <- Maps for `date` in {1205, 1305, 1705, 1805} and `sc_id` in {1A, 1B, 2, 3}

    │ ├── 3009_map.png <- Map for the Scenarios 4A, 4B and 5

    ├── CSVs_Scenarios <- Directory for aligned data for all scenarios

    │ ├── Scenario_1 <- Directory for the CSV files for Scenario 1

    │ ├── Scenario_2 <- Directory for the CSV files for Scenario 2

    │ ├── Scenario_3 <- Directory for the CSV files for Scenario 3

    │ ├── Scenario_4 <- Directory for the CSV files for Scenario 4

    │ ├── Scenario_5 <- Directory for the CSV files for Scenario 5

    ├── TSVs_RAWET <- Directory for the TSV files for the Raw Eyetracking data for all Scenarios

    │ ├── synch_info.csv <- Event markers necessary to align motion capture with eyetracking data

    │ ├── Files <- Directory with all the raw eyetracking TSV files

    2. Data Structure and Dataset Files

    Withing each Scenario directory, each csv file contains:

    2.1. Headers

    The dataset metadata overview contains important information found in the CSV file headers. This reference is designed to help users understand and use the dataset effectively. The headers include details such as FILE_ID, which provides information on the date, scenario, condition, and run associated with each recording. The header of the document includes important quantities such as the number of frames recorded (N_FRAMES_QTM), the count of rigid bodies (N_BODIES), and the total number of markers (N_MARKERS).

    It also provides information about the order of the contiguous rotation matrix (CONTIGUOUS_ROTATION_MATRIX), modalities measured with units, and specified measurement units. The text presents details on the eyetracking devices used in each recording, including their infrared sensor and scene camera frequencies, as well as an indication of the presence of eyetracking data.

    The header provides specific information about rigid bodies, including their names (BODY_NAMES), role labels (BODY_ROLES), and the number of markers associated with each rigid body (BODY_NR_MARKERS). Finally, the table lists all marker names used in the file.

    This metadata provides researchers and practitioners with essential guidance on recording information, data quantities, and specifics about rigid bodies and markers. It is a valuable resource for understanding and effectively using the dataset in the CSV files.

    2.2. Trajectory Data

    The remaining portion of the CSV file integrates merged data from the motion capture system and eye tracking devices, organized based on participants' helmet rigid bodies. Columns within the dataset include XYZ coordinates of all markers, spatial centroid coordinates, 6DOF orientation of the object's local coordinate frame, and if available eye tracking data, encompassing 2D/3D gaze coordinates, scene recording frame numbers, eye movement types, and IMU data.

    Missing data is denoted by "N/A" or an empty cell. Temporal indexing is facilitated by the "Time" or "Frame" column, indicating timestamps or frame numbers. The motion capture system records at 100Hz, Tobii Glasses at 50Hz (Raw); 25 Hz (Camera), and Pupil Glasses at 100Hz (Raw); 30 Hz (Camera). The dataset is structured around motion capture recordings, and for each rigid body, such as "Helmet_1," details per frame include XYZ coordinates of markers, centroid coordinates, and a 9-element rotational matrix describing helmet orientation.

    HeaderExplanation
    Helmet_1 - 1 XX-Coordinate of Marker Number 1
    Helmet_1 - 1 YY-Coordinate of Marker Number 1
    Helmet_1 - 1 ZZ-Coordinate of Marker Number 1
    Helmet_1 - [...]Same for Marker 2 and 3 of Helmet_1
    Helmet_1 Centroid_XX-Coordinate of the Centroid
    Helmet_1 Centroid_YY-Coordinate of the Centroid
    Helmet_1 Centroid_ZZ-Coordinate of the Centroid
    Helmet_1 R01st Element of the CONTIGUOUS_ROTATION_MATRIX
    Helmet_1 R[..]Same for R1- R7
    Helmet_1 R89th Element of the CONTIGUOUS_ROTATION_MATRIX

    2.3. Eyetracking Data

    The eye tracking data in the dataset includes 16 participants, providing a comprehensive dataset of over 500 minutes of recorded data across the different activities and scenarios with three different eyetracking devices. Devices are denoted with a special "Tracker_ID" in the dataset, i.e.:

    Tracker IDEyetracking Device
    TB2Tobii 2 Glasses
    TB3Tobii 3 Glasses
    PPLPupil Insivisible Glasses

    Gaze points are classified into fixations and saccades using the Tobii I-VT Attention filter, which is specifically optimized for dynamic scenarios with a velocity threshold of 100°. Eyetracking devices were systematically repeated after each 4-minute recording to account for natural variations in participants' eye shapes and to improve the gaze estimation algorithms. In addition, gaze estimation adjustments for the pupil invisible glasses were made after each 4-minute recording to mitigate potential drifts. It's worth noting that the scene cameras of the eye tracking glasses had different fields of view. The scene camera of the Pupil Invisible Glasses had a 1088x1080 image with both horizontal (HFOV) and vertical (VFOV) opening angles of 80°, while the Tobii Glasses provided a 1920x1080 image with different opening angles for Tobii Glasses 3 (HFOV: 95°, VFOV: 63°) and Tobii Glasses 2 (HFOV: 82°, VFOV: 52°).

    NOTE: Videos are not part of the dataset, they will be made available in 2024

    For one participant, wearing the Tobii Glasses 3 and Helmet_6, the data would be denoted as:

    HeaderExplanation
    Helmet_6 -

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1

Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects

Related Article
Explore at:
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
Description

This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.

Search
Clear search
Close search
Google apps
Main menu