11 datasets found

d
Replication Data for: Revisiting 'The Rise and Decline' in a Population of...
search.dataone.org
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SG3LP1
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
Description
This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
riiid_train_converted to Multiple Formats
kaggle.com
Updated Jun 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santh Raul (2021). riiid_train_converted to Multiple Formats [Dataset]. https://www.kaggle.com/santhraul/riiid-train-converted-to-multiple-formats/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 2, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Santh Raul
Description
Context

Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.

Content

Train data of Riiid competition in different formats.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.
f
OES_RSI Dataset
figshare.com
docx
Updated Aug 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tricia Salzar; Mark Benden (2020). OES_RSI Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12753545.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12753545.v1
Dataset updated
Aug 3, 2020
Dataset provided by
figshare
Authors
Tricia Salzar; Mark Benden
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data was colected from a large company as part of a corporate wellness program. The data were provide by Cority Enviance and collected via their data logging software RSIGuard and Remedy's Interactive Workplace Injury Prevention Program. The data is located in the CSV file and contains measures of discomfort and computer utilization for the three years of interest. Data cover 2012-2015 and include three computer use records per participant (28, 91 and 364 days) and two discomfort and workstation measurements per participant. The data is de-identified (includes a study ID) and contains no demographics for participants. Variable descriptions are included in the separate word file.
r
Data from: Data files used to study the distribution of growth in software...
researchdata.edu.au
Updated May 4, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Swinburne University of Technology (2011). Data files used to study the distribution of growth in software systems [Dataset]. https://researchdata.edu.au/files-used-study-software-systems/14865
Explore at:
Dataset updated
May 4, 2011
Dataset provided by
Swinburne University of Technology
Description
The evolution of a software system can be studied in terms of how various properties as reflected by software metrics change over time. Current models of software evolution have allowed for inferences to be drawn about certain attributes of the software system, for instance, regarding the architecture, complexity and its impact on the development effort. However, an inherent limitation of these models is that they do not provide any direct insight into where growth takes place. In particular, we cannot assess the impact of evolution on the underlying distribution of size and complexity among the various classes. Such an analysis is needed in order to answer questions such as 'do developers tend to evenly distribute complexity as systems get bigger?', and 'do large and complex classes get bigger over time?'. These are questions of more than passing interest since by understanding what typical and successful software evolution looks like, we can identify anomalous situations and take action earlier than might otherwise be possible. Information gained from an analysis of the distribution of growth will also show if there are consistent boundaries within which a software design structure exists. In our study of metric distributions, we focused on 10 different measures that span a range of size and complexity measures. The raw metric data (4 .txt files and 1 .log file in a .zip file measuring ~0.5MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).
AV : Healthcare Analytics
kaggle.com
zip
Updated Sep 13, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shivan kumar (2020). AV : Healthcare Analytics [Dataset]. https://www.kaggle.com/shivan118/healthcare-analytics
Explore at:
zip(1591838 bytes)Available download formats
Dataset updated
Sep 13, 2020
Authors
shivan kumar
Description
Context

MedCamp organizes health camps in several cities with low work-life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them the facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of the camp).

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and the Number of people taking tests at the Camps. In the last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than the required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than the required inventory for conducting these medical checks, people end up having bab experience.

The Process:

MedCamp employees/volunteers reach out to people and drive registrations.

During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of the health camp.

Other things to note:

Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.

For a few camps, there was a hardware failure, so some information about the date and time of registration is lost.

MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

Favorable outcome:

For the first 2 formats, a favorable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.

You need to predict the chances (probability) of having a favorable outcome.

Data Description

Train.zip contains the following 6 csv alongside the data dictionary that contains definitions for each variable

Health_Camp_Detail.csv – File containing Health_Camp_Id, Camp_Start_Date, Camp_End_Date and Category details of each camp.

Train.csv – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

Patient_Profile.csv – This file contains Patient profile details like Patient_ID, Online_Follower, Social media details, Income, Education, Age, First_Interaction_Date, City_Type and Employer_Category

First_Health_Camp_Attended.csv – This file contains details about people who attended health camp of first format. This includes Donation (amount) & Health_Score of the person.

Second_Health_Camp_Attended.csv - This file contains details about people who attended health camp of second format. This includes Health_Score of the person.

Third_Health_Camp_Attended.csv - This file contains details about people who attended health camp of third format. This includes Number_of_stall_visited & Last_Stall_Visited_Number.

Test Set

Test.csv – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

Train / Test split:

Camps started on or before 31st March 2006 are considered in Train Test data is for all camps conducted on or after 1st April 2006.

Sample Submission:

Patient_ID: Unique Identifier for each patient. This ID is not sequential in nature and can not be used in modeling

Health_Camp_ID: Unique Identifier for each camp. This ID is not sequential in nature and can not be used in modeling

Outcome: Predicted probability of a favorable outcome.

Evaluation Metric

The evaluation metric for this hackathon is ROC-AUC Score.
Z
Songdo Traffic: High Accuracy Georeferenced Vehicle Trajectories from a...
data.niaid.nih.gov
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geroliminis, Nikolas (2025). Songdo Traffic: High Accuracy Georeferenced Vehicle Trajectories from a Large-Scale Study in a Smart City [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13828383
Explore at:
Dataset updated
Mar 17, 2025
Dataset provided by
Fonod, Robert
Geroliminis, Nikolas
Yeo, Hwasoo
Cho, Haechan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Songdo-dong
Description
Overview

The Songdo Traffic dataset delivers precisely georeferenced vehicle trajectories captured through high-altitude bird's-eye view (BeV) drone footage over Songdo International Business District, South Korea. Comprising approximately 700,000 unique trajectories, this resource represents one of the most extensive aerial traffic datasets publicly available, distinguishing itself through exceptional temporal resolution that captures vehicle movements at 29.97 points per second, enabling unprecedented granularity for advanced urban mobility analysis.

⚠️ Important: If you use this dataset in your work, please cite the following reference [1]:

Robert Fonod, Haechan Cho, Hwasoo Yeo, Nikolas Geroliminis (2025). Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery, arXiv preprint arXiv:2411.02136.

(Note: This manuscript shall be replaced by the published version once available.)

Dataset Composition

The dataset consists of four primary components:

Trajectory Data: 80 ZIP archives containing high-resolution vehicle trajectories with georeferenced positions, speeds and acceleration profiles, and other metadata.

Orthophoto Cut-Outs: High-resolution (8000×8000 pixel) orthophoto images for each monitored intersection, used for georeferencing and visualization.

Road and Lane Segmentations: CSV files defining lane polygons within road sections, facilitating mapping of vehicle positions to road segments and lanes.

Sample Videos: A selection of 4K UHD drone video samples capturing intersection footage during the experiment.

Data Collection

The dataset was collected as part of a collaborative multi-drone experiment conducted by KAIST and EPFL in Songdo, South Korea, from October 4–7, 2022.

A fleet of 10 drones monitored 20 busy intersections, executing advanced flight plans to optimize coverage.

4K (3840×2160) RGB video footage was recorded at 29.97 FPS from altitudes of 140–150 meters.

Each drone flew 10 sessions per day, covering peak morning and afternoon periods.

The experiment resulted in 12TB of 4K raw video data.

More details on the experimental setup and data processing pipeline are available in [1].

Data Processing

The trajectories were extracted using geo-trax, an advanced deep learning framework designed for high-altitude UAV-based traffic monitoring. This state-of-the-art pipeline integrates vehicle detection, tracking, trajectory stabilization, and georeferencing to extract high-accuracy traffic data from drone footage.

Key Processing Steps:

Vehicle Detection & Tracking: Vehicles were detected and tracked across frames using a deep learning-based detector and motion-model-based tracking algorithm.

Trajectory Stabilization: A novel track stabilization method was applied using detected vehicle bounding boxes as exclusion masks in image registration.

Georeferencing & Coordinate Transformation: Each trajectory was transformed into global (WGS84), local Cartesian, and orthophoto coordinate systems.

Vehicle Metadata Estimation: In addition to time-stamped vehicle trajectories, various metadata attributes were also extracted, including vehicle dimensions and type, speed, acceleration, class, lane number, road section, and visibility status.

More details on the extraction methodology are available in [1].

File Structure & Formats

Trajectory Data (Daily Intersection ZIPs, 16.2 MB ~ 360.2 MB)

The trajectory data is organized into 80 ZIP files, each containing traffic data for a specific intersection and day of the experiment.

File Naming Convention:

YYYY-MM-DD_intersectionID.zip

YYYY-MM-DD represents the date of data collection (2022-10-04 to 2022-10-07).

intersectionID is a unique identifier for one of the 20 intersections where data was collected (A, B, C, E, …, U). The letter D is reserved to denote "Drone".

Each ZIP file contains 10 CSV files, each corresponding to an individual flight session:

YYYY-MM-DD_intersectionID.zip │── YYYY-MM-DD_intersectionID_AM1.csv ├── … │── YYYY-MM-DD_intersectionID_AM5.csv │── YYYY-MM-DD_intersectionID_PM1.csv ├── … └── YYYY-MM-DD_intersectionID_PM5.csv

Here, AM1-AM5 and PM1-PM5 denote morning and afternoon flight sessions, respectively. For example, 2022-10-04_S_AM1.csv contains all extracted trajectories from the first morning session of the first day at the intersection 'S'.

CSV File Example Structure:

Each CSV file contains high-frequency trajectory data, formatted as follows (d.p. = decimal place):

Dataset Column Name Format / Units Data Type Explanation

Vehicle_ID 1, 2, … Integer Unique vehicle identifier within each CSV file

Local_Time hh:mm:ss.sss String Local Korean time (GMT+9) in ISO 8601 format

Drone_ID 1, 2, …, 10 Integer Unique identifier for the drone capturing the data

Ortho_X, Ortho_Y px (1 d.p.) Float Vehicle center coordinates in the orthophoto cut-out image

Local_X, Local_Y m (2 d.p.) Float KGD2002 / Central Belt 2010 planar coordinates (EPSG:5186)

Latitude, Longitude ° DD (7 d.p.) Float WGS84 geographic coordinates in decimal degrees (EPSG:4326)

Vehicle_Length*, Vehicle_Width* m (2 d.p.) Float Estimated physical dimensions of the vehicle

Vehicle_Class Categorical (0–3) Integer Vehicle type: 0 (car/van), 1 (bus), 2 (truck), 3 (motorcycle)

Vehicle_Speed* km/h (1 d.p.) Float Estimated speed computed from trajectory data using Gaussian smoothing

Vehicle_Acceleration* m/s² (2 d.p.) Float Estimated acceleration derived from smoothed speed values

Road_Section* N_G String Road section identifier (N = node, G = lane group)

Lane_Number* 1, 2, … Integer Lane position (1 = leftmost lane in the direction of travel)

Visibility 0/1 Boolean 1 = fully visible, 0 = partially visible in the camera frame

These columns may be empty under certain conditions, see [1] for more details.

Orthophoto Cut-Outs (orthophotos.zip, 1.8 GB)

For each intersection, we provide the high-resolution orthophoto cut-outs that were used for georeferencing. These 8000×8000 pixel PNG images cover specific areas, allowing users to overlay orthophoto trajectories within the road network.

orthophotos/ │── A.png │── B.png │── … └── U.png

For more details on the orthophoto generation process, refer to [1].

Orthophoto Segmentations (segmentations.zip, 24.9 KB)

We provide the road and lane segmentations for each orthophoto cut-out, stored as CSV files where each row defines a lane polygon within a road section.

Each section (N_G) groups lanes moving in the same direction, with lanes numbered sequentially from the innermost outward. The CSV files are structured as follows:

segmentations/ │── A.csv │── B.csv │── … └── U.csv

Each file contains the following columns:

Section: Road section ID (N_G format).

Lane: Lane number within the section.

tlx, tly, blx, bly, brx, bry, trx, try: Polygon corner coordinates.

These segmentations enabled trajectory points to be mapped to specific lanes and sections in our trajectory dataset. Vehicles outside segmented areas (e.g., intersection centers) remain unlabeled. Perspective distortions may also cause misalignments for taller vehicles.

Sample Videos (sample_videos.zip, 26.8 GB)

The dataset includes 29 video samples, each capturing the first 60 seconds of drone hovering over its designated intersection during the final session (PM5) on October 7, 2022. These high-resolution 4K videos provide additional context for trajectory analysis and visualization, complementing the orthophoto cut-outs and segmentations.

sample_videos/ │── A_D1_2022-10-07_PM5_60s.mp4 │── A_D2_2022-10-07_PM5_60s.mp4 │── B_D1_2022-10-07_PM5_60s.mp4 │── … └── U_D10_2022-10-07_PM5_60s.mp4

Additional Files

README.md – Dataset documentation (this file)

LICENSE.txt – Creative Commons Attribution 4.0 License

Known Dataset Artifacts and Limitations

While this dataset is designed for high accuracy, users should be aware of the following known artifacts and limitations:

Trajectory Fragmentation: Trajectories may be fragmented for motorcycles in complex road infrastructure scenarios (pedestrian crossings, bicycle lanes, traffic signals) and for certain underrepresented truck variants. Additional fragmentations occurred when drones experienced technical issues during hovering, necessitating mid-recording splits that naturally resulted in divided trajectories.

Vehicle ID Ambiguities: The largest Vehicle_ID in a CSV file does not necessarily indicate the total number of unique vehicles.

Kinematic Estimation Limitations: Speed and acceleration values are derived from raw tracking data and may be affected by minor errors due to detection inaccuracies, stabilization artifacts, and applied interpolation and smoothing techniques.

Vehicle Dimension Estimation: Estimates may be unreliable for stationary or non-axially moving vehicles and can be affected by bounding box overestimations capturing protruding vehicle parts or shadows.

Lane and Section Assignment Inaccuracies: Perspective effects may cause vehicles with significant heights, such as trucks or buses, to be misassigned to incorrect lanes or sections in the orthophoto.

Occasional pedestrian pair misclassifications: Rarely, two pedestrians walking side by side may be briefly mistaken for a motorcycle, but such instances are short-lived and typically removed by the short trajectory filter.

For a comprehensive discussion of dataset limitations and validation procedures, refer to [1].

Citation & Attribution

Preferred Citation:

If you use Songdo Traffic for any purpose, whether in academic research, commercial applications, open-source projects, or benchmarking efforts, please cite our accompanying manuscript [1]:

Robert Fonod, Haechan Cho, Hwasoo Yeo, Nikolas Geroliminis (2025). Advanced computer vision for extracting georeferenced vehicle
f
Data from: Change Point Detection in WLANs with Random AP Forests
figshare.com
txt
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexis Huet; Jonatan Krolikowski; Jose Manuel Navarro; Fuxing Chen; dario rossi (2023). Change Point Detection in WLANs with Random AP Forests [Dataset]. http://doi.org/10.6084/m9.figshare.23566146.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23566146.v1
Dataset updated
Oct 9, 2023
Dataset provided by
figshare
Authors
Alexis Huet; Jonatan Krolikowski; Jose Manuel Navarro; Fuxing Chen; dario rossi
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Artifacts from the ACM CoNEXT' 23 paper "Change Point Detection in WLANs with Random AP Forests".Collection environmentThe collection campaign targeted a real 5GHz band WLAN and lasted over two months. The WLAN is a production network with thousands of daily users, located in a single floor of a large building equipped with 33 APs.The telemetry consists of a single AP-level Key Performance Indicator (KPI), collected every minute. In particular, KPI in the dataset is the path loss measurement among pairs of APs and is aggregated in time (at a 15 minutes granularity corresponding to up to 96 samples per day) and space (for each undirected AP pair, a single average value of the pathloss is reported). This information is contained in the rapf_kpis.csv file.A portion of the dataset is further annotated with ground truth concerning 9 relevant events that alter the signals, providing a benchmark to assess the validity of the change point detection. This information is contained in the rapf_labels.csv file.ArtifactsDataset comprises two csv files: rapf_kpis.csv (corresponding to the path loss KPI time series), and rapf_labels.csv (corresponding to the ground truth events).For both datasets, rows correspond to time, while the 529 columns correspond to:the timestamp (column 1),the 528 distinct undirected pairs between the 33 APs, expressed with the format APx--APy (with x < y, as undirected KPIs are reported, for the remaining columns).Time series KPI dataEach measured path loss value is collected for a certain AP pair over a 15 minutes timeframe. During a period of two months (from Dec 02, 2021 to Feb 09, 2022), 2.2 million samples were collected among the 528 AP pairs. Missing values are introduced either by technical problems in the data collection (for 2 days among the 70 collection days), or by the inability for APs to collect path loss measurements of the distant neighbors (roughly 35% of samples are missing in the dataset). The path loss values are expressed in dB and range from 43 to 119 (median and mean of 78). Missing values are expressed as NA. In more details:Collection from 2021-12-02T00:00:00Z to 2022-02-09T23:45:00Z (70 days) following ISO-8601 date representation.In particular, all the timestamps are UTC.Among those 70 days, there are technical problems for two days (2021-12-23, and 2022-02-02) which are not present in the csv.There are 96 timeslots in a day, so the number of rows (without the header) is (70-2)x96 = 6528.Among those 6528 rows, there is one timestamp containing NA for all pairs (2021-12-18T07:30:00Z), while others have all at least one pair that is non missing.Ground truth labelsEach ground truth event contains information regarding the date of occurrence (timestamp) and of the impacted pairs. As ground truth events are scarce, the timestamps that are not present in the rapf_labels.csv do not contain any event for any AP pairs. Overall, rapf_labels.csv contains 9 rows (one per event) and 529 columns (one per each AP pair and timestamp). A cell has either the value of 0 (no change point for this pair at that timestamp) or 1 (a change point occurred at that timestamp for this pair).We provide a succinct description of the nine respective events as follows:Event 1: Dec 06, 2021 at 17:45; all 33 APs impacted.Event 2: Dec 07, 2021 at 16:00; 29 APs impacted (AP15, AP16, AP19, AP24 are not impacted, corresponding to 6 non-impacted AP pairs AP15--AP16, AP15--AP19, AP15--AP24, AP16--AP19, AP16--AP24, AP19--AP24).Event 3: Dec 09, 2021 at 17:30; 1 AP impacted (AP24 is impacted, corresponding to 32 impacted AP pairs).Event 4: Dec 14, 2021 at 16:45; 29 APs impacted (AP10, AP17, AP19, AP28 are not impacted, corresponding to 6 non-impacted AP pairs AP10--AP17, AP10--AP19, AP10--AP28, AP17--AP19, AP17--AP28, AP19--AP28).Event 5: Dec 21, 2021 at 21:30; 29 APs impacted (AP6, AP11, AP19, AP20 are not impacted, corresponding to 6 non-impacted AP pairs AP6--AP11, AP6--AP19, AP6--AP20, AP11--AP19, AP11--AP20, AP19--AP20).Event 6: Jan 04, 2022 at 16:45; all 33 APs impacted.Event 7: Jan 11, 2022 at 18:15; all 33 APs impacted.Event 8: Jan 18, 2022 at 16:30; all 33 APs impacted.Event 9: Feb 04, 2022 at 16:30; all 33 APs impacted.
TBX11K Simplified - TB X-rays with bounding boxes
kaggle.com
Updated Feb 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vbookshelf (2023). TBX11K Simplified - TB X-rays with bounding boxes [Dataset]. https://www.kaggle.com/datasets/vbookshelf/tbx11k-simplified/suggestions?status=pending&yourSuggestions=true
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2023
Dataset provided by
Kaggle
Authors
vbookshelf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The TBX11K dataset is a large dataset containing 11000 chest x-ray images. It's the only TB dataset that I know of that includes TB bounding boxes. This allows both classification and detection models to be trained.

However, it can be mentally tiring to get started with this dataset. It includes many xml, json and txt files that you need to sift through to try to understand what everything means, how it all fits together and how to extract the bounding box coordinates.

Here I've simplified the dataset. Now there's just one csv file, one folder containing the training images and one folder containing the test images.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1086574%2Fa637d3837c261605a3c2f71a18a9b6f0%2FScreenshot%202023-02-08%20at%2012.26.29.png?generation=1675834031712582&alt=media" alt="">

Paper: Rethinking Computer-aided Tuberculosis Diagnosis

Original TBX11K dataset on Kaggle

Notes

1- Please start by reading the paper. It will help you understand what everything means. 2- The original dataset was split into train and validation sets. This split is shown in the 'source' column in the data.csv file. 3- The test images are stored in the folder called "test". There are no labels for these images and I've not included them in data.csv. 4- Each bounding box is on a separate row. Therefore, the file names in the "fname" column are not unique. For example, if an image has two bounding boxes then the file name for that image will appear twice in the "fname" column. 5- The original dataset has a folder named "extra" that contains data from other TB datasets. I've not included that folder here.

Acknowledgements

Many thanks to the team that created the TBX11K dataset and generously made it publicly available.

Citation

# TBX11K dataset @inproceedings{liu2020rethinking, title={Rethinking computer-aided tuberculosis diagnosis}, author={Liu, Yun and Wu, Yu-Huan and Ban, Yunfeng and Wang, Huifang and Cheng, Ming-Ming}, booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={2646--2655}, year={2020} }

Helpful Resources

This is a list of publicly available TB and Pneumonia chest x-ray datasets: https://github.com/vbookshelf/List-of-TB-and-Pneumonia-Datasets
Gait Database
figshare.com
zip
Updated Jul 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nazli Rafei Dehkordi; saman farahmand (2022). Gait Database [Dataset]. http://doi.org/10.6084/m9.figshare.20346852.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20346852.v1
Dataset updated
Jul 21, 2022
Dataset provided by
figshare
Authors
Nazli Rafei Dehkordi; saman farahmand
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Gait recognition is the characterization of unqiue biometric patterns associated with each inidvidual which can be utilized to identify a person without direct contact. A public gain database with relatively large number of subjects can provide a great oppportunity to future studies to build and validate gait authentication models. The goal of this study is to introduce a comprehensive gait database of 93 human subjects who walked between two end points (320 meters) during two different sessions and record their gait data using two smart phones, one was attached to right thigh and another one on left side of waist. This data is collected with intention to be utilized by deep learning-based method which requires enough time points. The meta data including age, gender, smoking, daily exercise time, height, and weight of an individual is recorded. this data set is publicly available.

Except 19 subjects who did not attend for second session, every subject is associated with 4 different log files (each session contains two log files). Every file name has one of the following patterns: · sub0-lw-s1.csv: subject number 0, left waist, session 1 · sub0-rp-s1.csv: subject number 0, right thigh, session 1 · sub0-lw-s2.csv: subject number 0, left waist, session 2 · sub0-rp-s2.csv: subject number 0, right thigh, session 2 Every log file contains 58 features that are internally captured and calculated using SensorLog app. Additionally, an Excel file contain the meta data is provided for each subject.
HiRISE Image Patches Obscured by Atmospheric Dust
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gary Doran; Gary Doran (2020). HiRISE Image Patches Obscured by Atmospheric Dust [Dataset]. http://doi.org/10.5281/zenodo.3495068
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3495068
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gary Doran; Gary Doran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

The purpose of this dataset is to train a classifier to detect "dusty" versus "not dusty" patches within browse-resolution HiRISE observations of the Martian surface. Here, "dusty" refers to images in which the view of the surface has been obscured heavily by atmospheric dust.

The dataset contains two sets of 20,000 image patches each from EDR (full resolution) and RDR ("browse" resolution) non-map-projected ("nomap") HiRISE images, with balanced classes. The patches have been split into train (n = 10,000), validation (n = 5,000), and test (n = 5,000) sets such that no two patches from the same HiRISE observation appear in more than one of these subsets. There could be some noise in the labels, but a subset of the validation images have been manually vetted so that label noise rates can be estimated. More details on the dataset creation process are described below.

Generating Candidate Images and Patches

To begin constructing the dataset, the paper "The origin, evolution, and trajectory of large dust storms on Mars during Mars years 24–30 (1999–2011)," by Wang and Richardson (2015), was used to compile a set of time ranges for which global or regional dust storms were known to be occurring on Mars. All HiRISE RDR nomap browse images acquired within these time ranges were then inspected manually to determine sets of images that were (1) almost entirely obscured by dust and (2) almost entirely clear of dust. Then, 10,000 patches from the two subsets of images were extracted to form the "dusty" and "not dusty" classes. The extracted patches are 100-by-100 pixels, which roughly corresponds to the width of one CCD channel within the browse image (the width of the raw EDR data products that are stitched together to form a full RDR image). Some small amount of label noise is introduced in this process, since a patch from a mostly dusty image might happen to contain a clear view of the ground, and a patch from a mostly non-dusty image might contain some dust or regions on the surface that are featureless and appear like dusty patches. A set of "vetting labels" is included, which includes human annotations by the author for a subset of the validation set of patches. These labels can be used to estimate the apparent label noise in the dataset.

Corresponding to the RDR patch dataset, a set of patches are extracted from the same set of EDR images for the "dusty" and "not dusty" classes. EDRs are raw images from the instrument that have not been calibrated or stitched together. To provide some form of normalization, EDR patches are only extracted from the lower half of the EDRs, with the upper half being used to perform a basic calibration of the lower half. Basic calibration is done by subtracting the sample (image column) averages from the upper half to remove "striping," then computing the 0.1^th and 99.9^th percentiles of the remaining values in the upper half and stretching the image patch to 8-bit integer values [0, 255] within that range. The calibration is meant to implement a process that could be performed onboard the spacecraft as the data is being observed (hence, using the top half of the image acquired first to calibrate the lower half of the image which is acquired later). The full resolution EDRs, which are 1024 pixels wide, are resized down to 100-by-100 pixel patches after being extracted so that they roughly match the resolution of the patches from the RDR browse images.

Archive Contents

The compressed archive file contains two top-level directories with similar contents, "edr_nomap_full_resized" and "rdr_nomap_browse." The first directory contains the dataset constructed from EDR data and the second contains the dataset constructed from RDR data.

Within each directory, there are "dusty" and "not_dusty" directories containing the image patches from each class, "manifest.csv," and "vetting_labels.csv." The vetting labels file contains a list of manually labeled examples, along with the original labels to make it easier to compute label noise rates. The "manifest.csv" file contains a list of every example, its label, and whether it belongs to the train, validation, or test set.

An example ID encodes information about where the patch was sampled from the original HiRISE image. As an example from the RDR dataset, the ID "003100_PSP_004440_2125_r4805_c512" can be broken into several parts:

"003100" is a unique numerical ID

"PSP_004440_2125" is the HiRISE observation ID

"r4805_c512" means the patch's upper left corner starts at the 4805^th row and 512^th column of the original observation

For the EDR dataset, the ID "200000_PSP_004530_1030_RED7_1_r9153" is broken down as follows:

"200000" is a unique numerical ID

"PSP_004530_1030" is the HiRISE observation ID

"RED7" is the CCD ID

"1" is the CCD channel (either 0 or 1)

"r9153" means that the patch is extracted starting at the 9153^rd row (since all columns of the 1024-pixel EDR are used, no column is specified; it is implicitly always 0)

Original Data

The original HiRISE EDR and RDR data is available via the Planetary Data System (PDS), hosted at https://hirise-pds.lpl.arizona.edu/PDS/

THÖR-MAGNI: A Large-scale Indoor Motion Capture Recording of Human Movement...

zenodo.org

zip

Updated Feb 7, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Tim Schreiter; Tim Schreiter; Tiago Rodrigues de Almeida; Tiago Rodrigues de Almeida; Yufei Zhu; Yufei Zhu; Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro; Andrey Rudenko; Andrey Rudenko (2024). THÖR-MAGNI: A Large-scale Indoor Motion Capture Recording of Human Movement and Interaction [Dataset]. http://doi.org/10.5281/zenodo.10554472

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10554472

Dataset updated

Feb 7, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Tim Schreiter; Tim Schreiter; Tiago Rodrigues de Almeida; Tiago Rodrigues de Almeida; Yufei Zhu; Yufei Zhu; Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro; Andrey Rudenko; Andrey Rudenko

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The THÖR-MAGNI Dataset Tutorials

THÖR-MAGNI datasets is a novel dataset of accurate human and robot navigation and interaction in diverse indoor contexts, building on the previous THÖR dataset protocol. We provide position and head orientation motion capture data, 3D LiDAR scans and gaze tracking. In total, THÖR-MAGNI captures 3.5 hours of motion of 40 participants on 5 recording days.

This data collection is designed around systematic variation of factors in the environment to allow building cue-conditioned models of human motion and verifying hypotheses on factor impact. To that end, THÖR-MAGNI encompasses 5 scenarios, in which some of them have different conditions (i.e., we vary some factor):

Scenario 1 (plus conditions A and B):
- Participants move in groups and individually;
- Robot as static obstacle;
- Environment with 3 obstacles and lane marking on the floor for condition B;

Scenario 2:
- Participants move in groups, individually and transport objects with variable difficulty (i.e. bucket, boxes and a poster stand);
- Robot as static obstacle;
- Environment with 3 obstacles;

Scenario 3 (plus conditions A and B):
- Participants move in groups, individually and transporting objects with variable difficulty (i.e. bucket, boxes and a poster stand). We denote each role as: Visitors-Alone, Visitors-Group 2, Visitors-Group 3, Carrier-Bucket, Carrier-Box, Carrier-Large Object;
- Teleoperated robot as moving agent: in condition A, the robot moves with differential drive; in condition B, the robot moves with omni-directional drive;
- Environment with 2 obstacles;

Scenario 4 (plus conditions A and B):
- All participants, denoted as Visitors-Alone HRI interacted with the teleoperated mobile robot;
- Robot interacted in two ways: in condition A (Verbal-Only), the Anthropomorphic Robot Mock Driver (ARMoD), a small humanoid NAO robot on top of the mobile platform, only used speech to communicate the next goal point to the participant; in condition B the ARMoD used speech, gestures and robotic gaze to convey the same message;
- Free space environment

Scenario 5:
- Participants move alone (Visitors-Alone) and one of the participants, denoted as Visitors-Alone HRI, transport objects and interact with the robot;
- The ARMoD is remotely controlled by an experimenter and proactively offers help;
- Free space environment;

Preliminary steps

Before proceeding, make sure to download the data from ZENODO

1. Directory Structure

├── docs

│ ├── tutorials.md <- Tutorials document on how to use the data

├── goals_positions.csv <- File with the goals locations

├── maps <- Directory for maps of the environment (PNG files) and offsets (json file)

│ ├── offsets.json <- Offsets of the map with respect to the global coordinate frame origin

│ ├── {date}_SC{sc_id}_map.png <- Maps for `date` in {1205, 1305, 1705, 1805} and `sc_id` in {1A, 1B, 2, 3}

│ ├── 3009_map.png <- Map for the Scenarios 4A, 4B and 5

├── CSVs_Scenarios <- Directory for aligned data for all scenarios

│ ├── Scenario_1 <- Directory for the CSV files for Scenario 1

│ ├── Scenario_2 <- Directory for the CSV files for Scenario 2

│ ├── Scenario_3 <- Directory for the CSV files for Scenario 3

│ ├── Scenario_4 <- Directory for the CSV files for Scenario 4

│ ├── Scenario_5 <- Directory for the CSV files for Scenario 5

├── TSVs_RAWET <- Directory for the TSV files for the Raw Eyetracking data for all Scenarios

│ ├── synch_info.csv <- Event markers necessary to align motion capture with eyetracking data

│ ├── Files <- Directory with all the raw eyetracking TSV files

2. Data Structure and Dataset Files

Withing each Scenario directory, each csv file contains:

2.1. Headers

The dataset metadata overview contains important information found in the CSV file headers. This reference is designed to help users understand and use the dataset effectively. The headers include details such as FILE_ID, which provides information on the date, scenario, condition, and run associated with each recording. The header of the document includes important quantities such as the number of frames recorded (N_FRAMES_QTM), the count of rigid bodies (N_BODIES), and the total number of markers (N_MARKERS).

It also provides information about the order of the contiguous rotation matrix (CONTIGUOUS_ROTATION_MATRIX), modalities measured with units, and specified measurement units. The text presents details on the eyetracking devices used in each recording, including their infrared sensor and scene camera frequencies, as well as an indication of the presence of eyetracking data.

The header provides specific information about rigid bodies, including their names (BODY_NAMES), role labels (BODY_ROLES), and the number of markers associated with each rigid body (BODY_NR_MARKERS). Finally, the table lists all marker names used in the file.

This metadata provides researchers and practitioners with essential guidance on recording information, data quantities, and specifics about rigid bodies and markers. It is a valuable resource for understanding and effectively using the dataset in the CSV files.

2.2. Trajectory Data

The remaining portion of the CSV file integrates merged data from the motion capture system and eye tracking devices, organized based on participants' helmet rigid bodies. Columns within the dataset include XYZ coordinates of all markers, spatial centroid coordinates, 6DOF orientation of the object's local coordinate frame, and if available eye tracking data, encompassing 2D/3D gaze coordinates, scene recording frame numbers, eye movement types, and IMU data.

Missing data is denoted by "N/A" or an empty cell. Temporal indexing is facilitated by the "Time" or "Frame" column, indicating timestamps or frame numbers. The motion capture system records at 100Hz, Tobii Glasses at 50Hz (Raw); 25 Hz (Camera), and Pupil Glasses at 100Hz (Raw); 30 Hz (Camera). The dataset is structured around motion capture recordings, and for each rigid body, such as "Helmet_1," details per frame include XYZ coordinates of markers, centroid coordinates, and a 9-element rotational matrix describing helmet orientation.

Header	Explanation
Helmet_1 - 1 X	X-Coordinate of Marker Number 1
Helmet_1 - 1 Y	Y-Coordinate of Marker Number 1
Helmet_1 - 1 Z	Z-Coordinate of Marker Number 1
Helmet_1 - [...]	Same for Marker 2 and 3 of Helmet_1
Helmet_1 Centroid_X	X-Coordinate of the Centroid
Helmet_1 Centroid_Y	Y-Coordinate of the Centroid
Helmet_1 Centroid_Z	Z-Coordinate of the Centroid
Helmet_1 R0	1st Element of the CONTIGUOUS_ROTATION_MATRIX
Helmet_1 R[..]	Same for R1- R7
Helmet_1 R8	9th Element of the CONTIGUOUS_ROTATION_MATRIX

2.3. Eyetracking Data

The eye tracking data in the dataset includes 16 participants, providing a comprehensive dataset of over 500 minutes of recorded data across the different activities and scenarios with three different eyetracking devices. Devices are denoted with a special "Tracker_ID" in the dataset, i.e.:

Tracker ID	Eyetracking Device
TB2	Tobii 2 Glasses
TB3	Tobii 3 Glasses
PPL	Pupil Insivisible Glasses

Gaze points are classified into fixations and saccades using the Tobii I-VT Attention filter, which is specifically optimized for dynamic scenarios with a velocity threshold of 100°. Eyetracking devices were systematically repeated after each 4-minute recording to account for natural variations in participants' eye shapes and to improve the gaze estimation algorithms. In addition, gaze estimation adjustments for the pupil invisible glasses were made after each 4-minute recording to mitigate potential drifts. It's worth noting that the scene cameras of the eye tracking glasses had different fields of view. The scene camera of the Pupil Invisible Glasses had a 1088x1080 image with both horizontal (HFOV) and vertical (VFOV) opening angles of 80°, while the Tobii Glasses provided a 1920x1080 image with different opening angles for Tobii Glasses 3 (HFOV: 95°, VFOV: 63°) and Tobii Glasses 2 (HFOV: 82°, VFOV: 52°).

NOTE: Videos are not part of the dataset, they will be made available in 2024

For one participant, wearing the Tobii Glasses 3 and Helmet_6, the data would be denoted as:

Header	Explanation
Helmet_6 -

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1

Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects

Explore at:

Unique identifier

https://doi.org/10.7910/DVN/SG3LP1

Dataset updated

Nov 22, 2023

Dataset provided by

Harvard Dataverse

Authors

TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill

Description

This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.

Clear search

Close search

Google apps

Main menu

Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

riiid_train_converted to Multiple Formats

Context

Content

Acknowledgements

Inspiration

OES_RSI Dataset

Data from: Data files used to study the distribution of growth in software...

AV : Healthcare Analytics

Context

The Process:

Other things to note:

Favorable outcome:

Data Description

Train.zip contains the following 6 csv alongside the data dictionary that contains definitions for each variable

Test Set

Train / Test split:

Sample Submission:

Evaluation Metric

Songdo Traffic: High Accuracy Georeferenced Vehicle Trajectories from a...

Data from: Change Point Detection in WLANs with Random AP Forests

TBX11K Simplified - TB X-rays with bounding boxes

Notes

Acknowledgements

Citation

Helpful Resources

Gait Database

HiRISE Image Patches Obscured by Atmospheric Dust