100+ datasets found

f
Data from: Is repairing speech errors an automatic or a controlled process?...
datasetcatalog.nlm.nih.gov
tandf.figshare.com
Updated Sep 13, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McCloskey, Nicholas; Martin, Clara D.; Nozari, Nazbanou (2019). Is repairing speech errors an automatic or a controlled process? Insights from the relationship between error and repair probabilities in English and Spanish [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000097521
Explore at:
Dataset updated
Sep 13, 2019
Authors
McCloskey, Nicholas; Martin, Clara D.; Nozari, Nazbanou
Description
Speakers can correct their speech errors, but the mechanisms behind repairs are still unclear. Some findings, such as the speed of repairs and speakers’ occasional unawareness of them, point to an automatic repair process. This paper reports a finding that challenges a purely automatic repair process. Specifically, we show that as error rate increases, so does the proportion of repairs. Twenty highly-proficient English-Spanish bilinguals described dynamic visual events in real time (e.g. “The blue bottle disappears behind the brown curtain”) in English and Spanish blocks. Both error rates and proportion of corrected errors were higher on (a) noun phrase (NP)2 vs. NP1, and (b) word1 (adjective in English and noun in Spanish) vs. word2 within the NP. These results show a consistent relationship between error and repair probabilities, disentangled from position, compatible with a model in which greater control is recruited in error-prone situations to enhance the effectiveness of repair.
H
Replication data for: How Robust Standard Errors Expose Methodological...
dataverse.harvard.edu
Updated Jan 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gary King; Margaret Roberts (2023). Replication data for: How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It [Dataset]. http://doi.org/10.7910/DVN/26935
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/26935
Dataset updated
Jan 11, 2023
Dataset provided by
Harvard Dataverse
Authors
Gary King; Margaret Roberts
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
"Robust standard errors" are used in a vast array of scholarship to correct standard errors for model misspecification. However, when misspecification is bad enough to make classical and robust standard errors diverge, assuming that it is nevertheless not so bad as to bias everything else requires considerable optimism. And even if the optimism is warranted, settling for a misspecified model, with or without robust standard errors, w ill still bias estimators of all but a few quantities of interest. Even though this message is well known to methodologists, it has failed to reach most applied researchers. The resulting cavernous gap between theory and practice suggests that considerable gains in applied statistics may be possible. We seek to help applied researchers realize these gains via an alternative perspective that offers a productive way to use robust standard errors; a new general and easier-to-use "generalized information matrix test" statistic; and practical illustrations via simulations and real examples from published research. Instead of jettisoning this extremely popular tool, as some suggest, we show how robust and classical standard error differences can provide effective clues about model misspecification, likely biases, and a guide to more reliable inferences. See also: Unifying Statistical Analysis
r
Semi-supervised data cleaning
resodate.org
Updated Dec 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Mahdavi Lahijani (2020). Semi-supervised data cleaning [Dataset]. http://doi.org/10.14279/depositonce-10928
Explore at:
Unique identifier
https://doi.org/10.14279/depositonce-10928
Dataset updated
Dec 4, 2020
Dataset provided by
Technische Universität Berlin
DepositOnce
Authors
Mohammad Mahdavi Lahijani
Description
Data cleaning is one of the most important but time-consuming tasks for data scientists. The data cleaning task consists of two major steps: (1) error detection and (2) error correction. The goal of error detection is to identify wrong data values. The goal of error correction is to fix these wrong values. Data cleaning is a challenging task due to the trade-off among correctness, completeness, and automation. In fact, detecting/correcting all data errors accurately without any user involvement is not possible for every dataset. We propose a novel data cleaning approach that detects/corrects data errors with a novel two-step task formulation. The intuition is that, by collecting a set of base error detectors/correctors that can independently mark/fix data errors, we can learn to combine them into a final set of data errors/corrections using a few informative user labels. First, each base error detector/corrector generates an initial set of potential data errors/corrections. Then, the approach ensembles the output of these base error detectors/correctors into one final set of data errors/corrections in a semi-supervised manner. In fact, the approach iteratively asks the user to annotate a tuple, i.e., marking/fixing a few data errors. The approach learns to generalize the user-provided error detection/correction examples to the rest of the dataset, accordingly. Our novel two-step formulation of the error detection/correction task has four benefits. First, the approach is configuration free and does not need any user-provided rules or parameters. In fact, the approach considers the base error detectors/correctors as black-box algorithms that are not necessarily correct or complete. Second, the approach is effective in the error detection/correction task as its first and second steps maximize recall and precision, respectively. Third, the approach also minimizes human involvement as it samples the most informative tuples of the dataset for user labeling. Fourth, the task formulation of our approach allows us to leverage previous data cleaning efforts to optimize the current data cleaning task. We design an end-to-end data cleaning pipeline according to this approach that takes a dirty dataset as input and outputs a cleaned dataset. Our pipeline leverages user feedback, a set of data cleaning algorithms, and a set of previously cleaned datasets, if available. Internally, our pipeline consists of an error detection system (named Raha), an error correction system (named Baran), and a transfer learning engine. As our extensive experiments show, our data cleaning systems are effective and efficient, and involve the user minimally. Raha and Baran significantly outperform existing data cleaning approaches in terms of effectiveness and human involvement on multiple well-known datasets.
d
Data from: Performance and accuracy of lightweight and low-cost GPS data...
search.dataone.org
datasetcatalog.nlm.nih.gov
+3more
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marie-AmÃ©lie Forin-Wiart; Pauline Hubert; Pascal Sirguey; Marie-Lazarine Poulle (2025). Performance and accuracy of lightweight and low-cost GPS data loggers according to antenna positions, fix intervals, habitats and animal movements [Dataset]. http://doi.org/10.5061/dryad.7nm7b
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.7nm7b
Dataset updated
Jun 27, 2025
Dataset provided by
Dryad Digital Repository
Authors
Marie-AmÃ©lie Forin-Wiart; Pauline Hubert; Pascal Sirguey; Marie-Lazarine Poulle
Time period covered
Jan 1, 2016
Description
Recently developed low-cost Global Positioning System (GPS) data loggers are promising tools for wildlife research because of their affordability for low-budget projects and ability to simultaneously track a greater number of individuals compared with expensive built-in wildlife GPS. However, the reliability of these devices must be carefully examined because they were not developed to track wildlife. This study aimed to assess the performance and accuracy of commercially available GPS data loggers for the first time using the same methods applied to test built-in wildlife GPS. The effects of antenna position, fix interval and habitat on the fix-success rate (FSR) and location error (LE) of CatLog data loggers were investigated in stationary tests, whereas the effects of animal movements on these errors were investigated in motion tests. The units operated well and presented consistent performance and accuracy over time in stationary tests, and the FSR was good for all antenna positions...
Sberbank Russian Housing Market Data Fix
kaggle.com
zip
Updated May 7, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Anderson (2017). Sberbank Russian Housing Market Data Fix [Dataset]. https://www.kaggle.com/matthewa313/sberbankdatafix
Explore at:
zip(17180267 bytes)Available download formats
Dataset updated
May 7, 2017
Authors
Matthew Anderson
Area covered
Russia
Description
Upon reviewing the train data for the Sberbank Russian Housing Market competition, I noticed noise & errors. Obviously, neither of these should be present in your training set, and as such, you should remove them. This is the updated train set with all noise & errors I found removed.

Data was removed when:

full_sq-life_sq<0 full_sq-kitch_sq<0 life_sq-kitch_sq<0 floor-max_floor<0

I simply deleted the row from the dataset, and did not really use anything special other than that.
h
Data from: autorepair
huggingface.co
Updated Aug 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Automatic Program Comprehension Lab (2024). autorepair [Dataset]. https://huggingface.co/datasets/apcl/autorepair
Explore at:
Dataset updated
Aug 11, 2024
Dataset authored and provided by
Automatic Program Comprehension Lab
Description
A Lossless Syntax Tree Generator with Zero-shot Error Correction

This repository includes all of the datasets to reproduce the resuls in the paper and the srcml files that we generated. We follow Jam's procedure to compile the dataset for pretraining and finetuning.

Dataset files

Filename Description

bin.tar.gz bin files to finetune the model to fix the syntatic error

fundats.tar.gz data files to generate srcml with the error correction in the zero-shot… See the full description on the dataset page: https://huggingface.co/datasets/apcl/autorepair.
f
Comparison between fix success rate (FSR) ± standard deviation and root mean...
figshare.com
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariano R. Recio; Renaud Mathieu; Paul Denys; Pascal Sirguey; Philip J. Seddon (2023). Comparison between fix success rate (FSR) ± standard deviation and root mean square of location errors (LERMS), mean location errors (µLE) ± standard deviation and median (µ1/2LE) obtained from analysis of data collected at stationary tests (N = 60) under different habitats, vegetation configuration and sky availability. [Dataset]. http://doi.org/10.1371/journal.pone.0028225.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0028225.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Mariano R. Recio; Renaud Mathieu; Paul Denys; Pascal Sirguey; Philip J. Seddon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outliers correspond to fixes with location error (LE)>3 standard deviations from the mean location error of all fixes in the same habitat (i.e., without regard to the visibility category). The last two columns report on the mean number of outliers ± standard deviation across each visibility, and LERMS values calculated from all fixes in the same habitat after removal of outlier values.
d
Fix: Employee Payroll Data (FMPS Payroll Costing) - 7/10/2025
catalog.data.gov
Updated Jul 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofchicago.org (2025). Fix: Employee Payroll Data (FMPS Payroll Costing) - 7/10/2025 [Dataset]. https://catalog.data.gov/dataset/fix-employee-payroll-data-fmps-payroll-costing-7-10-2025
Explore at:
Dataset updated
Jul 12, 2025
Dataset provided by
data.cityofchicago.org
Description
Reload to correct some errors.
f
Results from stationary unit tests performed with 40 low-cost CatLog GPS...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 18, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poulle, Marie-Lazarine; Forin-Wiart, Marie-Amélie; Sirguey, Pascal; Hubert, Pauline (2015). Results from stationary unit tests performed with 40 low-cost CatLog GPS data loggers: the fix success rate (FSR) ± standard deviation (SD), mean time of the fix acquisition (μFAT), root mean square of the location errors (LERMS), mean location error (μLE), median location error (mLE), percentage of fixes with LE < 10 m, the mean number of outliers per unit (N outliers) and root mean square of the location errors after the removal of outliers (LERMS without outliers) for positional fixes collected from for two antenna positions, three fix intervals programs and four habitat types. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001919478
Explore at:
Dataset updated
Jun 18, 2015
Authors
Poulle, Marie-Lazarine; Forin-Wiart, Marie-Amélie; Sirguey, Pascal; Hubert, Pauline
Description
Results from stationary unit tests performed with 40 low-cost CatLog GPS data loggers: the fix success rate (FSR) ± standard deviation (SD), mean time of the fix acquisition (μFAT), root mean square of the location errors (LERMS), mean location error (μLE), median location error (mLE), percentage of fixes with LE < 10 m, the mean number of outliers per unit (N outliers) and root mean square of the location errors after the removal of outliers (LERMS without outliers) for positional fixes collected from for two antenna positions, three fix intervals programs and four habitat types.
Z
200 Annotated Developer Human Errors from GitHub
data.niaid.nih.gov
zenodo.org
Updated Jan 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meyers, Benjamin; Meneely, Andrew (2024). 200 Annotated Developer Human Errors from GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10080448
Explore at:
Dataset updated
Jan 4, 2024
Dataset provided by
Rochester Institute of Technology
Authors
Meyers, Benjamin; Meneely, Andrew
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Software Engineers' Human Errors

This dataset contains 200 GitHub comments with manual human error annotations, released as part of the following publication:

Benjamin S. Meyers. Human Error Assessment in Software Engineering. Rochester Institute of Technology. 2023.

Included Files

The "developer_human_errors.csv" file contains the full dataset of 200 software defect descriptions annotated with human error types (slips, lapses, mistakes) and T.H.E.S.E. categories.

CSV Fields

ID: Unique identifier for the comment.

SOURCE: Whether this comment originates from a commit, issue, or pull request.

COMMENT_URL: The URL linking to the comment.

COMMENT_TEXT: The raw comment text.

HUMAN_ERROR_TYPE: Whether the software defect described is a slip, lapse, or mistake.

THESE_V4_ID: Manually assigned T.H.E.S.E. category with labels corresponding to Version 4 of T.H.E.S.E.

THESE_NAME: Name corresponding to manually assigned T.H.E.S.E. category.

Annotation Details

Human error types span slips, lapses, and mistakes from James Reason's Generic Error Modelling System (GEMS):

Slips: Failures of attention.

Lapses: Failures of memory.

Mistakes: Failures of planning.

T.H.E.S.E. categories are summarized below:

S01: Typos & Misspellings

S02: Syntax Errors

S03: Overlooking documented Information

S04: Multitasking Errors

S05: Hardware Interaction Errors

S06: Overlooking Proposed Code Changes

S07: Overlooking Existing Functionality

S08: General Attentional Failure

L01: Forgetting to Finish a Development Task

L02: Forgetting to Fix a Defect

L03: Forgetting to Remove Development Artifacts

L04: Working with Outdated Source Code

L05: Forgetting an Import Statement

L06: Forgetting to Save Work

L07: Forgetting Previous Development Discussion

L08: General Memory Failure

M01: Code Logic Errors

M02: Incomplete Domain Knowledge

M03: Wrong Assumption Errors

M04: Internal Communication Errors

M05: External Communication Errors

M06: Solution Choice Errors

M07: Time Management Errors

M08: Inadequate Testing

M09: Incorrect/Insufficient Configuration

M10: Code Complexity Errors

M11: Internationalization/String Encoding Errors

M12: Inadequate Experience Errors

M13: Insufficient Tooling Access Errors

M14: Workflow Order Errors

M15: General Planning Failure

Contact

Please contact Benjamin S. Meyers (email) with questions about this data and its collection.

Acknowledgments

Collection of this data has been sponsored in part by the National Science Foundation (grant 1922169), by the NSA Science of Security Lablet program (grant H98230-17-D-0080/2018-0438-02), and by a Department of Defense DARPA SBIR program (grant 140D63-19-C-0018).
British Job Agency Employment
kaggle.com
zip
Updated Jul 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul (2018). British Job Agency Employment [Dataset]. https://www.kaggle.com/rahul025/error-detection
Explore at:
zip(963570 bytes)Available download formats
Dataset updated
Jul 27, 2018
Authors
Rahul
Area covered
United Kingdom
Description
Auditing and Cleansing the Job dataset

The dataset description is shown below:

Columns and its Description

Id : 8 digit Id of the job advertisement,

Title: Title of the advertised job position,

Location: Location of the advertised job position,

ContractType: The contract type of the advertised job position, could be full-time, part-time or non-specified,

ContractTime: The contract time of the advertised job position, could be permanent, contract or non-specified,

Company: Company (employer) of the advertised job position,

Category: The Category of the advertised job position, e.g., IT jobs, Engineering Jobs, etc.

Salary per annum: Annual Salary of the advertised job position, e.g., 80000,

OpenDate: The opening time for applying for the advertised job position, e.g., 20120104T150000, means 3pm, 4th January 2012,

CloseDate: The closing time for applying for the advertised job position, e.g., 20120104T150000, means 3pm, 4th January 2012,

SourceName: The website where the job position is advertised.

In this task, you are required to inspect and audit the data (dataset1_with_error.csv) to identify the data problems, and then fix the problems. Different generic and major data problems could be found in the data might include:

Lexical errors Irregularities Violations of the Integrity constraint. Inconsistency In the end, save the error-free dataset in dataset1_solution.csv. The number of records in your solution should be the same as the number of those in the input file.
10 Years Bug-Fix Dataset (PROMISE'19)
figshare.com
search.datacite.org
zip
Updated Sep 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renan Vieira (2021). 10 Years Bug-Fix Dataset (PROMISE'19) [Dataset]. http://doi.org/10.6084/m9.figshare.8852084.v5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8852084.v5
Dataset updated
Sep 27, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Renan Vieira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication Package of the paper "From Reports to Bug-Fix Commits: A 10 Years Dataset of Bug-Fixing Activity from 55 Apache's Open Source Projects"ABSTRACT:Bugs appear in almost any software development. Solving all or at least a large part of them requires a great deal of time, effort, and budget. Software projects typically use issue tracking systems as a way to report and monitor bug-fixing tasks. In recent years, several researchers have been conducting bug tracking analysis to better understand the problem and thus provide means to reduce costs and improve the efficiency of the bug-fixing task. In this paper, we introduce a new dataset composed of more than 70,000 bug-fix reports from 10 years of bug-fixing activity of 55 projects from the Apache Software Foundation, distributed in 9 categories. We have mined this information from Jira issue track system concerning two different perspectives of reports with closed/resolved status: static (the latest version of reports) and dynamic (the changes that have occurred in reports over time). We also extract information from the commits (if they exist) that fix such bugs from their respective version-control system (Git).We also provide a change analysis that occurs in the reports as a way of illustrating and characterizing the proposed dataset. Once the data extraction process is an error-prone nontrivial task, we believe such initiatives like this could be useful to support researchers in further more detailed investigations.You can find the full paper at: https://doi.org/10.1145/3345629.3345639If you use this dataset for your research, please reference the following paper:@inproceedings{Vieira:2019:RBC:3345629.3345639, author = {Vieira, Renan and da Silva, Ant^{o}nio and Rocha, Lincoln and Gomes, Jo~{a}o Paulo}, title = {From Reports to Bug-Fix Commits: A 10 Years Dataset of Bug-Fixing Activity from 55 Apache's Open Source Projects}, booktitle = {Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering}, series = {PROMISE'19}, year = {2019}, isbn = {978-1-4503-7233-6}, location = {Recife, Brazil}, pages = {80--89}, numpages = {10}, url = {http://doi.acm.org/10.1145/3345629.3345639}, doi = {10.1145/3345629.3345639}, acmid = {3345639}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Bug-Fix Dataset, Mining Software Repositories, Software Traceability}, } P.S: We added a new dataset version (v1.0.1). In this version, we fix the git commit features that track the src and test files. More info can be found in the fix-script.py file.
Define Best Tariff for a Telecom Company
kaggle.com
zip
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roman Nikiforov (2024). Define Best Tariff for a Telecom Company [Dataset]. https://www.kaggle.com/datasets/romanniki/prospective-tariff-for-a-telecom-company
Explore at:
zip(3456315 bytes)Available download formats
Dataset updated
Aug 8, 2024
Authors
Roman Nikiforov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Determining the Prospective Tariff for a Telecom Company

Project Description

You are an analyst at "Megaline," a federal mobile operator. The company offers two tariff plans to customers: "Smart" and "Ultra." To adjust the advertising budget, the commercial department wants to understand which tariff generates more revenue.

You need to conduct a preliminary analysis of the tariffs on a small sample of customers. You have data on 500 users of "Megaline": who they are, where they are from, which tariff they use, how many calls and messages they sent in 2018. You need to analyze customer behavior and conclude which tariff is better.

Tariff Descriptions

"Smart" Tariff: - Monthly fee: 550 rubles - Included: 500 minutes of calls, 50 messages, and 15 GB of internet traffic - Cost of services beyond the tariff package: 1. Call minute: 3 rubles (Megaline always rounds up minutes and megabytes. If the user talked for just 1 second, it counts as a whole minute); 2. Message: 3 rubles; 3. 1 GB of internet traffic: 200 rubles.

"Ultra" Tariff: - Monthly fee: 1950 rubles - Included: 3000 minutes of calls, 1000 messages, and 30 GB of internet traffic - Cost of services beyond the tariff package: 1. Call minute: 1 ruble; 2. Message: 1 ruble; 3. 1 GB of internet traffic: 150 rubles.

Note: Megaline always rounds up seconds to minutes and megabytes to gigabytes. Each call is rounded up individually: even if it lasted just 1 second, it is counted as 1 minute. For web traffic, separate sessions are not counted. Instead, the total amount for the month is rounded up. If a subscriber uses 1025 megabytes in a month, they are charged for 2 gigabytes.

Project Steps

Step 1: Open the file with data and study the general information File paths: - /datasets/calls.csv - /datasets/internet.csv - /datasets/messages.csv - /datasets/tariffs.csv - /datasets/users.csv

Step 2: Prepare the data - Convert data to the required types; - Find and fix errors in the data, if any. Explain what errors you found and how you fixed them. You will find calls with zero duration in the data. This is not an error: missed calls are indicated by zeros, so they do not need to be deleted.

For each user, calculate: - Number of calls made and minutes spent per month; - Number of messages sent per month; - Amount of internet traffic used per month; - Monthly revenue from each user (subtract the free limit from the total number of calls, messages, and internet traffic; multiply the remainder by the value from the tariff plan; add the corresponding tariff plan's subscription fee).

Step 3: Analyze the data Describe the behavior of the operator's customers based on the sample. How many minutes of calls, how many messages, and how much internet traffic do users of each tariff need per month? Calculate the average, variance, and standard deviation. Create histograms. Describe the distributions.

Step 4: Test hypotheses - The average revenue of users of the "Ultra" and "Smart" tariffs is different; - The average revenue of users from Moscow differs from the revenue of users from other regions. Moscow is written as 'Москва'. You can put it in your value, when check the hypothesis

Set the threshold alpha value yourself.

Explain: - How you formulated the null and alternative hypotheses; - Which criterion you used to test the hypotheses and why.

Step 5: Write a general conclusion

Formatting: Perform the task in Jupyter Notebook. Fill the program code in the cells of type code, and the textual explanations in the cells of type markdown. Apply formatting and headers.

Data Description

Table users (user information): - user_id: unique user identifier - first_name: user's first name - last_name: user's last name - age: user's age (years) - reg_date: date of tariff connection (day, month, year) - churn_date: date of tariff discontinuation (if the value is missing, the tariff was still active at the time of data extraction) - city: user's city of residence - tariff: name of the tariff plan

Table calls (call information): - id: unique call number - call_date: call date - duration: call duration in minutes - user_id: identifier of the user who made the call

Table messages (message information): - id: unique message number - message_date: message date - user_id: identifier of the user who sent the message

Table internet (internet session information): - id: unique session number - mb_used: amount of internet traffic used during the session (in megabytes) - session_date: internet session date - user_id: user identifier

Table tariffs (tariff information): - tariff_name: tariff name - rub_monthly_fee: monthly subscription fee in rubles - minutes_included: number of call minutes included per month - `messages_included...
d
Data from: Development of Crime Forecasting and Mapping Systems for Use by...
catalog.data.gov
datasets.ai
+2more
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). Development of Crime Forecasting and Mapping Systems for Use by Police in Pittsburgh, Pennsylvania, and Rochester, New York, 1990-2001 [Dataset]. https://catalog.data.gov/dataset/development-of-crime-forecasting-and-mapping-systems-for-use-by-police-in-pittsburgh-1990--09e19
Explore at:
Dataset updated
Nov 14, 2025
Dataset provided by
National Institute of Justice
Area covered
Rochester, Pittsburgh, Pennsylvania
Description
This study was designed to develop crime forecasting as an application area for police in support of tactical deployment of resources. Data on crime offense reports and computer aided dispatch (CAD) drug calls and shots fired calls were collected from the Pittsburgh, Pennsylvania Bureau of Police for the years 1990 through 2001. Data on crime offense reports were collected from the Rochester, New York Police Department from January 1991 through December 2001. The Rochester CAD drug calls and shots fired calls were collected from January 1993 through May 2001. A total of 1,643,828 records (769,293 crime offense and 874,535 CAD) were collected from Pittsburgh, while 538,893 records (530,050 crime offense and 8,843 CAD) were collected from Rochester. ArcView 3.3 and GDT Dynamap 2000 Street centerline maps were used to address match the data, with some of the Pittsburgh data being cleaned to fix obvious errors and increase address match percentages. A SAS program was used to eliminate duplicate CAD calls based on time and location of the calls. For the 1990 through 1999 Pittsburgh crime offense data, the address match rate was 91 percent. The match rate for the 2000 through 2001 Pittsburgh crime offense data was 72 percent. The Pittsburgh CAD data address match rate for 1990 through 1999 was 85 percent, while for 2000 through 2001 the match rate was 100 percent because the new CAD system supplied incident coordinates. The address match rates for the Rochester crime offenses data was 96 percent, and 95 percent for the CAD data. Spatial overlay in ArcView was used to add geographic area identifiers for each data point: precinct, car beat, car beat plus, and 1990 Census tract. The crimes included for both Pittsburgh and Rochester were aggravated assault, arson, burglary, criminal mischief, misconduct, family violence, gambling, larceny, liquor law violations, motor vehicle theft, murder/manslaughter, prostitution, public drunkenness, rape, robbery, simple assaults, trespassing, vandalism, weapons, CAD drugs, and CAD shots fired.
🌍World Cities Population - cleaned version 🌍
kaggle.com
zip
Updated Oct 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donato Riccio (2022). 🌍World Cities Population - cleaned version 🌍 [Dataset]. https://www.kaggle.com/datasets/donatoriccio/world-cities-population-cleaned-version
Explore at:
zip(7645992 bytes)Available download formats
Dataset updated
Oct 12, 2022
Authors
Donato Riccio
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
About Dataset

All cities with a population seat of adm div (ca 80.000)Sources and Contributions Sources: GeoNames is aggregating over hundred different data sources. Ambassadors: GeoNames Ambassadors help in many countries. Wiki: A wiki allows you to view the data, quickly fix errors, and add missing places. Donations and Sponsoring: Costs for running GeoNames are covered by donations and sponsoring.enrichment:add country name

Context

Name Country Code Country Name Timezone Population Latitude Longitude Acknowledgments These data come from Maxmind.com and have not been altered. The original source can be found by clicking here

Additionally, Reference https://download.geonames.org/export/dump/ Attributions https://www.geonames.org/about.html
d
Study Interventions -- See ALERT below
catalog.data.gov
Updated Nov 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for PTSD (2025). Study Interventions -- See ALERT below [Dataset]. https://catalog.data.gov/dataset/study-interventions-78076
Explore at:
Dataset updated
Nov 2, 2025
Dataset provided by
National Center for PTSD
Description
ALERT: As of 10/15/2025, we are working to resolve a data error in treatment completion variables (percent and detail). We expect a resolution by 10/31/2025 at which point downloading the revised data is advised. The Study Interventions dataset includes information about each of the specific treatment arms that were studied in all RCTs. Each study arm was coded to indicate the type of intervention or comparison condition. This dataset includes the study-level Study Class as well as individual variables for each category of treatment, coded as Yes or No for each arm. Study arm treatment category variables are as follows: Pharmacotherapy (as well as a subclass such as antidepressant, antianxiety, etc.); Psychotherapy (as well as a subclass to identify trauma-focused or non-trauma-focused therapy); Complementary and Integrative Health (CIH; as well as a subclass such as relaxation or meditation); Nonpharmacologic Biological; Nonpharmacologic Cognitive; Collaborative Care; Other Treatments; Control The Study Intervention dataset also includes information on the format of the treatment (individual, group, couples, mixed); treatment delivery method (in person, by phone, by video, technology alone, technology assisted, written or mixed); dose or amount of treatment and, treatment completion and adherence. Use this dataset to learn about treatment studies of a particular type Each record is an arm of the study, labeled as A, B, C, or D. Values abstracted as not applicable ("NA") or not reported ("NR") from the study are null values (empty cells).
Transactional Retail Dataset of Electronics Store
kaggle.com
zip
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahrayar (2021). Transactional Retail Dataset of Electronics Store [Dataset]. https://www.kaggle.com/muhammadshahrayar/transactional-retail-dataset-of-electronics-store
Explore at:
zip(100952 bytes)Available download formats
Dataset updated
Jul 20, 2021
Authors
Shahrayar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset contains information about an online electronic store. The store has three warehouses from which goods are delivered to customers.

Columns Description

order_id: A unique id for each order

customer_id: A unique id for each customer

date: The date the order was made, given in YYYY-MM-DD format

nearest_warehouse: A string denoting the name of the nearest warehouse to the customer

shopping_cart: A list of tuples representing the order items: the first element of the tuple is the item ordered, and the second element is the quantity ordered for such item.

order_price: A float denoting the order price in USD. The order price is the price of items before any discounts and/or delivery charges are applied.

delivery_charges: A float representing the delivery charges of the order

customer_lat: Latitude of the customer’s location

customer_long: Longitude of the customer’s location

coupon_discount: An integer denoting the percentage discount to be applied to the order_price.

order_total: A float denoting the total of the order in USD after all discounts and/or delivery charges are applied.

season: A string denoting the season in which the order was placed.

is_expedited_delivery: A boolean denoting whether the customer has requested an expedited delivery

distance_to_nearest_warehouse: A float representing the arc distance, in kilometres, between the customer and the nearest warehouse to him/her.

latest_customer_review: A string representing the latest customer review on his/her most recent order

is_happy_customer: A boolean denoting whether the customer is a happy customer or had an issue with his/her last order.

Inspiration

Use this dataset to perform graphical and/or non-graphical EDA methods to understand the data first and then find and fix the data problems. - Detect and fix errors in dirty_data.csv - Impute the missing values in missing_data.csv - Detect and remove Anolamies - To check whether a customer is happy with their last order

All the Best
O
Standard
data.ct.gov
csv, xlsx, xml
Updated Dec 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy and Environmental Protection (2025). Standard [Dataset]. https://data.ct.gov/Environment-and-Natural-Resources/Standard/hvpp-uguz
Explore at:
csv, xlsx, xmlAvailable download formats
Dataset updated
Dec 2, 2025
Authors
Department of Energy and Environmental Protection
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
PLEASE NOTE: We know there are errors in the data although we strive to minimize them. Examples include: • Manifests completed incorrectly by the generator or the transporter - data was entered based on the incorrect information. We can only enter the information we receive. • Data entry errors – we now have QA/QC procedures in place to prevent or catch and fix a lot of these. • Historically there are multiple records of the same generator. Each variation in spelling in name or address generated a separate handler record. We have worked to minimize these but many remain. The good news is that as long as they all have the same EPA ID they will all show up in your search results. • Handlers provide erroneous data to obtain an EPA ID - data entry was based on erroneous information. Examples include incorrect or bogus addresses and names. There are also a lot of MISSPELLED NAMES AND ADDRESSES! • Missing manifests – Not every required manifest gets submitted to the DEP. Also, of the more than 100,000 paper manifests we receive each year, some were incorrectly handled and never entered. • Missing data – we know that the records for approximately 25 boxes of manifests, mostly prior to 1985 were lost from the database in the 1980’s. • Translation errors – the data has been migrated to newer data platforms numerous times, and each time there have been errors and data losses. • Wastes incorrectly entered – mostly due to complex names that were difficult to spell, or typos in quantities or units of measure.
w
Generator Summary View
data.wu.ac.at
csv, json, xml
Updated May 15, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy and Environmental Protection (2018). Generator Summary View [Dataset]. https://data.wu.ac.at/schema/data_ct_gov/NzJtaS0zZjgy
Explore at:
json, csv, xmlAvailable download formats
Dataset updated
May 15, 2018
Dataset provided by
Department of Energy and Environmental Protection
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
PLEASE NOTE: Use ALL CAPS when searching using the "Filter" function on text such as: LITCHFIELD. But not needed for the upper right corner "Find in this Dataset" search where for example "Litchfield" can be used.
We know there are errors in the data although we strive to minimize them. Examples include: • Manifests completed incorrectly by the generator or the transporter - data was entered based on the incorrect information. We can only enter the information we receive. • Data entry errors – we now have QA/QC procedures in place to prevent or catch and fix a lot of these. • Historically there are multiple records of the same generator. Each variation in spelling in name or address generated a separate handler record. We have worked to minimize these but many remain. The good news is that as long as they all have the same EPA ID they will all show up in your search results. • Handlers provide erroneous data to obtain an EPA ID - data entry was based on erroneous information. Examples include incorrect or bogus addresses and names. There are also a lot of MISSPELLED NAMES AND ADDRESSES! • Missing manifests – Not every required manifest gets submitted to the DEP. Also, of the more than 100,000 paper manifests we receive each year, some were incorrectly handled and never entered. • Missing data – we know that the records for approximately 25 boxes of manifests, mostly prior to 1985 were lost from the database in the 1980’s. • Translation errors – the data has been migrated to newer data platforms numerous times, and each time there have been errors and data losses. • Wastes incorrectly entered – mostly due to complex names that were difficult to spell, or typos in quantities or units of measure.
4
Enhancing Proof Assistant Error Messages with Hints: A User Study
data.4tu.nl
zip
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Khakimova; Jesper Cockx; Sára Juhošová; Jaro Reinders (2025). Enhancing Proof Assistant Error Messages with Hints: A User Study [Dataset]. http://doi.org/10.4121/79e7c4eb-81dc-492a-9ac4-69f33166de8e.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/79e7c4eb-81dc-492a-9ac4-69f33166de8e.v1
Dataset updated
Jun 18, 2025
Dataset provided by
4TU.ResearchData
Authors
Maria Khakimova; Jesper Cockx; Sára Juhošová; Jaro Reinders
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This repository contains the user study data accompanying the master thesis by Maria Khakimova titled "Enhancing Proof Assistant Error Messages with Hints: A User Study". The goal of the research was to investigate the impacts of hint-based error message enhancements in Agda on novice programmers. To do this, we enhanced three error messages with hints, and conducted a user study.

In the user study, we asked participants to resolve errors in pre-written Agda code, and rate the helpfulness of the error message. We collected the following data:
code compilation status (success/fail),
compilation timestamps, and
responses to the "Did you find the error message helpful?" question (on a Likert scale).

This repository contains the programming questions created for the user study, with the accompanying error messages (both original and enhanced) in programming_exercises.zip. We also provide the (anonymised) collected data in JSON format in response-data.json.

For more details, please read the provided README.

Facebook

Twitter

Click to copy link

Link copied

Cite

McCloskey, Nicholas; Martin, Clara D.; Nozari, Nazbanou (2019). Is repairing speech errors an automatic or a controlled process? Insights from the relationship between error and repair probabilities in English and Spanish [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000097521

Data from: Is repairing speech errors an automatic or a controlled process? Insights from the relationship between error and repair probabilities in English and Spanish

Explore at:

Dataset updated

Sep 13, 2019

Authors

McCloskey, Nicholas; Martin, Clara D.; Nozari, Nazbanou

Description

Speakers can correct their speech errors, but the mechanisms behind repairs are still unclear. Some findings, such as the speed of repairs and speakers’ occasional unawareness of them, point to an automatic repair process. This paper reports a finding that challenges a purely automatic repair process. Specifically, we show that as error rate increases, so does the proportion of repairs. Twenty highly-proficient English-Spanish bilinguals described dynamic visual events in real time (e.g. “The blue bottle disappears behind the brown curtain”) in English and Spanish blocks. Both error rates and proportion of corrected errors were higher on (a) noun phrase (NP)2 vs. NP1, and (b) word1 (adjective in English and noun in Spanish) vs. word2 within the NP. These results show a consistent relationship between error and repair probabilities, disentangled from position, compatible with a model in which greater control is recruited in error-prone situations to enhance the effectiveness of repair.

Clear search

Close search

Google apps

Main menu

Data from: Is repairing speech errors an automatic or a controlled process?...

Replication data for: How Robust Standard Errors Expose Methodological...

Semi-supervised data cleaning

Data from: Performance and accuracy of lightweight and low-cost GPS data...

Sberbank Russian Housing Market Data Fix

Data from: autorepair

Comparison between fix success rate (FSR) ± standard deviation and root mean...

Fix: Employee Payroll Data (FMPS Payroll Costing) - 7/10/2025

Results from stationary unit tests performed with 40 low-cost CatLog GPS...

200 Annotated Developer Human Errors from GitHub

British Job Agency Employment

Auditing and Cleansing the Job dataset

Columns and its Description

10 Years Bug-Fix Dataset (PROMISE'19)

Define Best Tariff for a Telecom Company

Determining the Prospective Tariff for a Telecom Company

Project Description

Tariff Descriptions

Project Steps

Data Description

Data from: Development of Crime Forecasting and Mapping Systems for Use by...

🌍World Cities Population - cleaned version 🌍

About Dataset

Context

Study Interventions -- See ALERT below

Transactional Retail Dataset of Electronics Store

Context

Columns Description

Inspiration

Standard

Generator Summary View

Enhancing Proof Assistant Error Messages with Hints: A User Study

Data from: Is repairing speech errors an automatic or a controlled process? Insights from the relationship between error and repair probabilities in English and Spanish