100+ datasets found
  1. f

    Data from: Is repairing speech errors an automatic or a controlled process?...

    • datasetcatalog.nlm.nih.gov
    • tandf.figshare.com
    Updated Sep 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McCloskey, Nicholas; Martin, Clara D.; Nozari, Nazbanou (2019). Is repairing speech errors an automatic or a controlled process? Insights from the relationship between error and repair probabilities in English and Spanish [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000097521
    Explore at:
    Dataset updated
    Sep 13, 2019
    Authors
    McCloskey, Nicholas; Martin, Clara D.; Nozari, Nazbanou
    Description

    Speakers can correct their speech errors, but the mechanisms behind repairs are still unclear. Some findings, such as the speed of repairs and speakers’ occasional unawareness of them, point to an automatic repair process. This paper reports a finding that challenges a purely automatic repair process. Specifically, we show that as error rate increases, so does the proportion of repairs. Twenty highly-proficient English-Spanish bilinguals described dynamic visual events in real time (e.g. “The blue bottle disappears behind the brown curtain”) in English and Spanish blocks. Both error rates and proportion of corrected errors were higher on (a) noun phrase (NP)2 vs. NP1, and (b) word1 (adjective in English and noun in Spanish) vs. word2 within the NP. These results show a consistent relationship between error and repair probabilities, disentangled from position, compatible with a model in which greater control is recruited in error-prone situations to enhance the effectiveness of repair.

  2. H

    Replication data for: How Robust Standard Errors Expose Methodological...

    • dataverse.harvard.edu
    Updated Jan 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gary King; Margaret Roberts (2023). Replication data for: How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It [Dataset]. http://doi.org/10.7910/DVN/26935
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gary King; Margaret Roberts
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    "Robust standard errors" are used in a vast array of scholarship to correct standard errors for model misspecification. However, when misspecification is bad enough to make classical and robust standard errors diverge, assuming that it is nevertheless not so bad as to bias everything else requires considerable optimism. And even if the optimism is warranted, settling for a misspecified model, with or without robust standard errors, w ill still bias estimators of all but a few quantities of interest. Even though this message is well known to methodologists, it has failed to reach most applied researchers. The resulting cavernous gap between theory and practice suggests that considerable gains in applied statistics may be possible. We seek to help applied researchers realize these gains via an alternative perspective that offers a productive way to use robust standard errors; a new general and easier-to-use "generalized information matrix test" statistic; and practical illustrations via simulations and real examples from published research. Instead of jettisoning this extremely popular tool, as some suggest, we show how robust and classical standard error differences can provide effective clues about model misspecification, likely biases, and a guide to more reliable inferences. See also: Unifying Statistical Analysis

  3. r

    Semi-supervised data cleaning

    • resodate.org
    Updated Dec 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Mahdavi Lahijani (2020). Semi-supervised data cleaning [Dataset]. http://doi.org/10.14279/depositonce-10928
    Explore at:
    Dataset updated
    Dec 4, 2020
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Mohammad Mahdavi Lahijani
    Description

    Data cleaning is one of the most important but time-consuming tasks for data scientists. The data cleaning task consists of two major steps: (1) error detection and (2) error correction. The goal of error detection is to identify wrong data values. The goal of error correction is to fix these wrong values. Data cleaning is a challenging task due to the trade-off among correctness, completeness, and automation. In fact, detecting/correcting all data errors accurately without any user involvement is not possible for every dataset. We propose a novel data cleaning approach that detects/corrects data errors with a novel two-step task formulation. The intuition is that, by collecting a set of base error detectors/correctors that can independently mark/fix data errors, we can learn to combine them into a final set of data errors/corrections using a few informative user labels. First, each base error detector/corrector generates an initial set of potential data errors/corrections. Then, the approach ensembles the output of these base error detectors/correctors into one final set of data errors/corrections in a semi-supervised manner. In fact, the approach iteratively asks the user to annotate a tuple, i.e., marking/fixing a few data errors. The approach learns to generalize the user-provided error detection/correction examples to the rest of the dataset, accordingly. Our novel two-step formulation of the error detection/correction task has four benefits. First, the approach is configuration free and does not need any user-provided rules or parameters. In fact, the approach considers the base error detectors/correctors as black-box algorithms that are not necessarily correct or complete. Second, the approach is effective in the error detection/correction task as its first and second steps maximize recall and precision, respectively. Third, the approach also minimizes human involvement as it samples the most informative tuples of the dataset for user labeling. Fourth, the task formulation of our approach allows us to leverage previous data cleaning efforts to optimize the current data cleaning task. We design an end-to-end data cleaning pipeline according to this approach that takes a dirty dataset as input and outputs a cleaned dataset. Our pipeline leverages user feedback, a set of data cleaning algorithms, and a set of previously cleaned datasets, if available. Internally, our pipeline consists of an error detection system (named Raha), an error correction system (named Baran), and a transfer learning engine. As our extensive experiments show, our data cleaning systems are effective and efficient, and involve the user minimally. Raha and Baran significantly outperform existing data cleaning approaches in terms of effectiveness and human involvement on multiple well-known datasets.

  4. d

    Data from: Performance and accuracy of lightweight and low-cost GPS data...

    • search.dataone.org
    • datasetcatalog.nlm.nih.gov
    • +3more
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marie-Amélie Forin-Wiart; Pauline Hubert; Pascal Sirguey; Marie-Lazarine Poulle (2025). Performance and accuracy of lightweight and low-cost GPS data loggers according to antenna positions, fix intervals, habitats and animal movements [Dataset]. http://doi.org/10.5061/dryad.7nm7b
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Marie-Amélie Forin-Wiart; Pauline Hubert; Pascal Sirguey; Marie-Lazarine Poulle
    Time period covered
    Jan 1, 2016
    Description

    Recently developed low-cost Global Positioning System (GPS) data loggers are promising tools for wildlife research because of their affordability for low-budget projects and ability to simultaneously track a greater number of individuals compared with expensive built-in wildlife GPS. However, the reliability of these devices must be carefully examined because they were not developed to track wildlife. This study aimed to assess the performance and accuracy of commercially available GPS data loggers for the first time using the same methods applied to test built-in wildlife GPS. The effects of antenna position, fix interval and habitat on the fix-success rate (FSR) and location error (LE) of CatLog data loggers were investigated in stationary tests, whereas the effects of animal movements on these errors were investigated in motion tests. The units operated well and presented consistent performance and accuracy over time in stationary tests, and the FSR was good for all antenna positions...

  5. Sberbank Russian Housing Market Data Fix

    • kaggle.com
    zip
    Updated May 7, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Anderson (2017). Sberbank Russian Housing Market Data Fix [Dataset]. https://www.kaggle.com/matthewa313/sberbankdatafix
    Explore at:
    zip(17180267 bytes)Available download formats
    Dataset updated
    May 7, 2017
    Authors
    Matthew Anderson
    Area covered
    Russia
    Description

    Upon reviewing the train data for the Sberbank Russian Housing Market competition, I noticed noise & errors. Obviously, neither of these should be present in your training set, and as such, you should remove them. This is the updated train set with all noise & errors I found removed.

    Data was removed when:

    full_sq-life_sq<0 full_sq-kitch_sq<0 life_sq-kitch_sq<0 floor-max_floor<0

    I simply deleted the row from the dataset, and did not really use anything special other than that.

  6. h

    Data from: autorepair

    • huggingface.co
    Updated Aug 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Automatic Program Comprehension Lab (2024). autorepair [Dataset]. https://huggingface.co/datasets/apcl/autorepair
    Explore at:
    Dataset updated
    Aug 11, 2024
    Dataset authored and provided by
    Automatic Program Comprehension Lab
    Description

    A Lossless Syntax Tree Generator with Zero-shot Error Correction

    This repository includes all of the datasets to reproduce the resuls in the paper and the srcml files that we generated. We follow Jam's procedure to compile the dataset for pretraining and finetuning.

      Dataset files
    

    Filename Description

    bin.tar.gz bin files to finetune the model to fix the syntatic error

    fundats.tar.gz data files to generate srcml with the error correction in the zero-shot… See the full description on the dataset page: https://huggingface.co/datasets/apcl/autorepair.

  7. f

    Comparison between fix success rate (FSR) ± standard deviation and root mean...

    • figshare.com
    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariano R. Recio; Renaud Mathieu; Paul Denys; Pascal Sirguey; Philip J. Seddon (2023). Comparison between fix success rate (FSR) ± standard deviation and root mean square of location errors (LERMS), mean location errors (µLE) ± standard deviation and median (µ1/2LE) obtained from analysis of data collected at stationary tests (N = 60) under different habitats, vegetation configuration and sky availability. [Dataset]. http://doi.org/10.1371/journal.pone.0028225.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Mariano R. Recio; Renaud Mathieu; Paul Denys; Pascal Sirguey; Philip J. Seddon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Outliers correspond to fixes with location error (LE)>3 standard deviations from the mean location error of all fixes in the same habitat (i.e., without regard to the visibility category). The last two columns report on the mean number of outliers ± standard deviation across each visibility, and LERMS values calculated from all fixes in the same habitat after removal of outlier values.

  8. d

    Fix: Employee Payroll Data (FMPS Payroll Costing) - 7/10/2025

    • catalog.data.gov
    Updated Jul 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofchicago.org (2025). Fix: Employee Payroll Data (FMPS Payroll Costing) - 7/10/2025 [Dataset]. https://catalog.data.gov/dataset/fix-employee-payroll-data-fmps-payroll-costing-7-10-2025
    Explore at:
    Dataset updated
    Jul 12, 2025
    Dataset provided by
    data.cityofchicago.org
    Description

    Reload to correct some errors.

  9. f

    Results from stationary unit tests performed with 40 low-cost CatLog GPS...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 18, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poulle, Marie-Lazarine; Forin-Wiart, Marie-Amélie; Sirguey, Pascal; Hubert, Pauline (2015). Results from stationary unit tests performed with 40 low-cost CatLog GPS data loggers: the fix success rate (FSR) ± standard deviation (SD), mean time of the fix acquisition (μFAT), root mean square of the location errors (LERMS), mean location error (μLE), median location error (mLE), percentage of fixes with LE < 10 m, the mean number of outliers per unit (N outliers) and root mean square of the location errors after the removal of outliers (LERMS without outliers) for positional fixes collected from for two antenna positions, three fix intervals programs and four habitat types. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001919478
    Explore at:
    Dataset updated
    Jun 18, 2015
    Authors
    Poulle, Marie-Lazarine; Forin-Wiart, Marie-Amélie; Sirguey, Pascal; Hubert, Pauline
    Description

    Results from stationary unit tests performed with 40 low-cost CatLog GPS data loggers: the fix success rate (FSR) ± standard deviation (SD), mean time of the fix acquisition (μFAT), root mean square of the location errors (LERMS), mean location error (μLE), median location error (mLE), percentage of fixes with LE < 10 m, the mean number of outliers per unit (N outliers) and root mean square of the location errors after the removal of outliers (LERMS without outliers) for positional fixes collected from for two antenna positions, three fix intervals programs and four habitat types.

  10. Z

    200 Annotated Developer Human Errors from GitHub

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meyers, Benjamin; Meneely, Andrew (2024). 200 Annotated Developer Human Errors from GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10080448
    Explore at:
    Dataset updated
    Jan 4, 2024
    Dataset provided by
    Rochester Institute of Technology
    Authors
    Meyers, Benjamin; Meneely, Andrew
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Software Engineers' Human Errors

    This dataset contains 200 GitHub comments with manual human error annotations, released as part of the following publication:

    Benjamin S. Meyers. Human Error Assessment in Software Engineering. Rochester Institute of Technology. 2023.

    Included Files

    The "developer_human_errors.csv" file contains the full dataset of 200 software defect descriptions annotated with human error types (slips, lapses, mistakes) and T.H.E.S.E. categories.

    CSV Fields

    ID: Unique identifier for the comment.

    SOURCE: Whether this comment originates from a commit, issue, or pull request.

    COMMENT_URL: The URL linking to the comment.

    COMMENT_TEXT: The raw comment text.

    HUMAN_ERROR_TYPE: Whether the software defect described is a slip, lapse, or mistake.

    THESE_V4_ID: Manually assigned T.H.E.S.E. category with labels corresponding to Version 4 of T.H.E.S.E.

    THESE_NAME: Name corresponding to manually assigned T.H.E.S.E. category.

    Annotation Details

    Human error types span slips, lapses, and mistakes from James Reason's Generic Error Modelling System (GEMS):

    Slips: Failures of attention.

    Lapses: Failures of memory.

    Mistakes: Failures of planning.

    T.H.E.S.E. categories are summarized below:

    S01: Typos & Misspellings

    S02: Syntax Errors

    S03: Overlooking documented Information

    S04: Multitasking Errors

    S05: Hardware Interaction Errors

    S06: Overlooking Proposed Code Changes

    S07: Overlooking Existing Functionality

    S08: General Attentional Failure

    L01: Forgetting to Finish a Development Task

    L02: Forgetting to Fix a Defect

    L03: Forgetting to Remove Development Artifacts

    L04: Working with Outdated Source Code

    L05: Forgetting an Import Statement

    L06: Forgetting to Save Work

    L07: Forgetting Previous Development Discussion

    L08: General Memory Failure

    M01: Code Logic Errors

    M02: Incomplete Domain Knowledge

    M03: Wrong Assumption Errors

    M04: Internal Communication Errors

    M05: External Communication Errors

    M06: Solution Choice Errors

    M07: Time Management Errors

    M08: Inadequate Testing

    M09: Incorrect/Insufficient Configuration

    M10: Code Complexity Errors

    M11: Internationalization/String Encoding Errors

    M12: Inadequate Experience Errors

    M13: Insufficient Tooling Access Errors

    M14: Workflow Order Errors

    M15: General Planning Failure

    Contact

    Please contact Benjamin S. Meyers (email) with questions about this data and its collection.

    Acknowledgments

    Collection of this data has been sponsored in part by the National Science Foundation (grant 1922169), by the NSA Science of Security Lablet program (grant H98230-17-D-0080/2018-0438-02), and by a Department of Defense DARPA SBIR program (grant 140D63-19-C-0018).

  11. British Job Agency Employment

    • kaggle.com
    zip
    Updated Jul 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul (2018). British Job Agency Employment [Dataset]. https://www.kaggle.com/rahul025/error-detection
    Explore at:
    zip(963570 bytes)Available download formats
    Dataset updated
    Jul 27, 2018
    Authors
    Rahul
    Area covered
    United Kingdom
    Description

    Auditing and Cleansing the Job dataset

    The dataset description is shown below:

    Columns and its Description

    Id : 8 digit Id of the job advertisement,

    Title: Title of the advertised job position,

    Location: Location of the advertised job position,

    ContractType: The contract type of the advertised job position, could be full-time, part-time or non-specified,

    ContractTime: The contract time of the advertised job position, could be permanent, contract or non-specified,

    Company: Company (employer) of the advertised job position,

    Category: The Category of the advertised job position, e.g., IT jobs, Engineering Jobs, etc.

    Salary per annum: Annual Salary of the advertised job position, e.g., 80000,

    OpenDate: The opening time for applying for the advertised job position, e.g., 20120104T150000, means 3pm, 4th January 2012,

    CloseDate: The closing time for applying for the advertised job position, e.g., 20120104T150000, means 3pm, 4th January 2012,

    SourceName: The website where the job position is advertised.

    In this task, you are required to inspect and audit the data (dataset1_with_error.csv) to identify the data problems, and then fix the problems. Different generic and major data problems could be found in the data might include:

    Lexical errors Irregularities Violations of the Integrity constraint. Inconsistency In the end, save the error-free dataset in dataset1_solution.csv. The number of records in your solution should be the same as the number of those in the input file.

  12. 10 Years Bug-Fix Dataset (PROMISE'19)

    • figshare.com
    • search.datacite.org
    zip
    Updated Sep 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renan Vieira (2021). 10 Years Bug-Fix Dataset (PROMISE'19) [Dataset]. http://doi.org/10.6084/m9.figshare.8852084.v5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 27, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Renan Vieira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Replication Package of the paper "From Reports to Bug-Fix Commits: A 10 Years Dataset of Bug-Fixing Activity from 55 Apache's Open Source Projects"ABSTRACT:Bugs appear in almost any software development. Solving all or at least a large part of them requires a great deal of time, effort, and budget. Software projects typically use issue tracking systems as a way to report and monitor bug-fixing tasks. In recent years, several researchers have been conducting bug tracking analysis to better understand the problem and thus provide means to reduce costs and improve the efficiency of the bug-fixing task. In this paper, we introduce a new dataset composed of more than 70,000 bug-fix reports from 10 years of bug-fixing activity of 55 projects from the Apache Software Foundation, distributed in 9 categories. We have mined this information from Jira issue track system concerning two different perspectives of reports with closed/resolved status: static (the latest version of reports) and dynamic (the changes that have occurred in reports over time). We also extract information from the commits (if they exist) that fix such bugs from their respective version-control system (Git).We also provide a change analysis that occurs in the reports as a way of illustrating and characterizing the proposed dataset. Once the data extraction process is an error-prone nontrivial task, we believe such initiatives like this could be useful to support researchers in further more detailed investigations.You can find the full paper at: https://doi.org/10.1145/3345629.3345639If you use this dataset for your research, please reference the following paper:@inproceedings{Vieira:2019:RBC:3345629.3345639, author = {Vieira, Renan and da Silva, Ant^{o}nio and Rocha, Lincoln and Gomes, Jo~{a}o Paulo}, title = {From Reports to Bug-Fix Commits: A 10 Years Dataset of Bug-Fixing Activity from 55 Apache's Open Source Projects}, booktitle = {Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering}, series = {PROMISE'19}, year = {2019}, isbn = {978-1-4503-7233-6}, location = {Recife, Brazil}, pages = {80--89}, numpages = {10}, url = {http://doi.acm.org/10.1145/3345629.3345639}, doi = {10.1145/3345629.3345639}, acmid = {3345639}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Bug-Fix Dataset, Mining Software Repositories, Software Traceability}, } P.S: We added a new dataset version (v1.0.1). In this version, we fix the git commit features that track the src and test files. More info can be found in the fix-script.py file.

  13. Define Best Tariff for a Telecom Company

    • kaggle.com
    zip
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roman Nikiforov (2024). Define Best Tariff for a Telecom Company [Dataset]. https://www.kaggle.com/datasets/romanniki/prospective-tariff-for-a-telecom-company
    Explore at:
    zip(3456315 bytes)Available download formats
    Dataset updated
    Aug 8, 2024
    Authors
    Roman Nikiforov
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Determining the Prospective Tariff for a Telecom Company

    Project Description

    You are an analyst at "Megaline," a federal mobile operator. The company offers two tariff plans to customers: "Smart" and "Ultra." To adjust the advertising budget, the commercial department wants to understand which tariff generates more revenue.

    You need to conduct a preliminary analysis of the tariffs on a small sample of customers. You have data on 500 users of "Megaline": who they are, where they are from, which tariff they use, how many calls and messages they sent in 2018. You need to analyze customer behavior and conclude which tariff is better.

    Tariff Descriptions

    "Smart" Tariff: - Monthly fee: 550 rubles - Included: 500 minutes of calls, 50 messages, and 15 GB of internet traffic - Cost of services beyond the tariff package: 1. Call minute: 3 rubles (Megaline always rounds up minutes and megabytes. If the user talked for just 1 second, it counts as a whole minute); 2. Message: 3 rubles; 3. 1 GB of internet traffic: 200 rubles.

    "Ultra" Tariff: - Monthly fee: 1950 rubles - Included: 3000 minutes of calls, 1000 messages, and 30 GB of internet traffic - Cost of services beyond the tariff package: 1. Call minute: 1 ruble; 2. Message: 1 ruble; 3. 1 GB of internet traffic: 150 rubles.

    Note: Megaline always rounds up seconds to minutes and megabytes to gigabytes. Each call is rounded up individually: even if it lasted just 1 second, it is counted as 1 minute. For web traffic, separate sessions are not counted. Instead, the total amount for the month is rounded up. If a subscriber uses 1025 megabytes in a month, they are charged for 2 gigabytes.

    Project Steps

    Step 1: Open the file with data and study the general information File paths: - /datasets/calls.csv - /datasets/internet.csv - /datasets/messages.csv - /datasets/tariffs.csv - /datasets/users.csv

    Step 2: Prepare the data - Convert data to the required types; - Find and fix errors in the data, if any. Explain what errors you found and how you fixed them. You will find calls with zero duration in the data. This is not an error: missed calls are indicated by zeros, so they do not need to be deleted.

    For each user, calculate: - Number of calls made and minutes spent per month; - Number of messages sent per month; - Amount of internet traffic used per month; - Monthly revenue from each user (subtract the free limit from the total number of calls, messages, and internet traffic; multiply the remainder by the value from the tariff plan; add the corresponding tariff plan's subscription fee).

    Step 3: Analyze the data Describe the behavior of the operator's customers based on the sample. How many minutes of calls, how many messages, and how much internet traffic do users of each tariff need per month? Calculate the average, variance, and standard deviation. Create histograms. Describe the distributions.

    Step 4: Test hypotheses - The average revenue of users of the "Ultra" and "Smart" tariffs is different; - The average revenue of users from Moscow differs from the revenue of users from other regions. Moscow is written as 'Москва'. You can put it in your value, when check the hypothesis

    Set the threshold alpha value yourself.

    Explain: - How you formulated the null and alternative hypotheses; - Which criterion you used to test the hypotheses and why.

    Step 5: Write a general conclusion

    Formatting: Perform the task in Jupyter Notebook. Fill the program code in the cells of type code, and the textual explanations in the cells of type markdown. Apply formatting and headers.

    Data Description

    Table users (user information): - user_id: unique user identifier - first_name: user's first name - last_name: user's last name - age: user's age (years) - reg_date: date of tariff connection (day, month, year) - churn_date: date of tariff discontinuation (if the value is missing, the tariff was still active at the time of data extraction) - city: user's city of residence - tariff: name of the tariff plan

    Table calls (call information): - id: unique call number - call_date: call date - duration: call duration in minutes - user_id: identifier of the user who made the call

    Table messages (message information): - id: unique message number - message_date: message date - user_id: identifier of the user who sent the message

    Table internet (internet session information): - id: unique session number - mb_used: amount of internet traffic used during the session (in megabytes) - session_date: internet session date - user_id: user identifier

    Table tariffs (tariff information): - tariff_name: tariff name - rub_monthly_fee: monthly subscription fee in rubles - minutes_included: number of call minutes included per month - `messages_included...

  14. d

    Data from: Development of Crime Forecasting and Mapping Systems for Use by...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Development of Crime Forecasting and Mapping Systems for Use by Police in Pittsburgh, Pennsylvania, and Rochester, New York, 1990-2001 [Dataset]. https://catalog.data.gov/dataset/development-of-crime-forecasting-and-mapping-systems-for-use-by-police-in-pittsburgh-1990--09e19
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    National Institute of Justice
    Area covered
    Rochester, Pittsburgh, Pennsylvania
    Description

    This study was designed to develop crime forecasting as an application area for police in support of tactical deployment of resources. Data on crime offense reports and computer aided dispatch (CAD) drug calls and shots fired calls were collected from the Pittsburgh, Pennsylvania Bureau of Police for the years 1990 through 2001. Data on crime offense reports were collected from the Rochester, New York Police Department from January 1991 through December 2001. The Rochester CAD drug calls and shots fired calls were collected from January 1993 through May 2001. A total of 1,643,828 records (769,293 crime offense and 874,535 CAD) were collected from Pittsburgh, while 538,893 records (530,050 crime offense and 8,843 CAD) were collected from Rochester. ArcView 3.3 and GDT Dynamap 2000 Street centerline maps were used to address match the data, with some of the Pittsburgh data being cleaned to fix obvious errors and increase address match percentages. A SAS program was used to eliminate duplicate CAD calls based on time and location of the calls. For the 1990 through 1999 Pittsburgh crime offense data, the address match rate was 91 percent. The match rate for the 2000 through 2001 Pittsburgh crime offense data was 72 percent. The Pittsburgh CAD data address match rate for 1990 through 1999 was 85 percent, while for 2000 through 2001 the match rate was 100 percent because the new CAD system supplied incident coordinates. The address match rates for the Rochester crime offenses data was 96 percent, and 95 percent for the CAD data. Spatial overlay in ArcView was used to add geographic area identifiers for each data point: precinct, car beat, car beat plus, and 1990 Census tract. The crimes included for both Pittsburgh and Rochester were aggravated assault, arson, burglary, criminal mischief, misconduct, family violence, gambling, larceny, liquor law violations, motor vehicle theft, murder/manslaughter, prostitution, public drunkenness, rape, robbery, simple assaults, trespassing, vandalism, weapons, CAD drugs, and CAD shots fired.

  15. 🌍World Cities Population - cleaned version 🌍

    • kaggle.com
    zip
    Updated Oct 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donato Riccio (2022). 🌍World Cities Population - cleaned version 🌍 [Dataset]. https://www.kaggle.com/datasets/donatoriccio/world-cities-population-cleaned-version
    Explore at:
    zip(7645992 bytes)Available download formats
    Dataset updated
    Oct 12, 2022
    Authors
    Donato Riccio
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    About Dataset

    All cities with a population seat of adm div (ca 80.000)Sources and Contributions Sources: GeoNames is aggregating over hundred different data sources. Ambassadors: GeoNames Ambassadors help in many countries. Wiki: A wiki allows you to view the data, quickly fix errors, and add missing places. Donations and Sponsoring: Costs for running GeoNames are covered by donations and sponsoring.enrichment:add country name

    Context

    Name Country Code Country Name Timezone Population Latitude Longitude Acknowledgments These data come from Maxmind.com and have not been altered. The original source can be found by clicking here

    Additionally, Reference https://download.geonames.org/export/dump/ Attributions https://www.geonames.org/about.html

  16. d

    Study Interventions -- See ALERT below

    • catalog.data.gov
    Updated Nov 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for PTSD (2025). Study Interventions -- See ALERT below [Dataset]. https://catalog.data.gov/dataset/study-interventions-78076
    Explore at:
    Dataset updated
    Nov 2, 2025
    Dataset provided by
    National Center for PTSD
    Description

    ALERT: As of 10/15/2025, we are working to resolve a data error in treatment completion variables (percent and detail). We expect a resolution by 10/31/2025 at which point downloading the revised data is advised. The Study Interventions dataset includes information about each of the specific treatment arms that were studied in all RCTs. Each study arm was coded to indicate the type of intervention or comparison condition. This dataset includes the study-level Study Class as well as individual variables for each category of treatment, coded as Yes or No for each arm. Study arm treatment category variables are as follows: Pharmacotherapy (as well as a subclass such as antidepressant, antianxiety, etc.); Psychotherapy (as well as a subclass to identify trauma-focused or non-trauma-focused therapy); Complementary and Integrative Health (CIH; as well as a subclass such as relaxation or meditation); Nonpharmacologic Biological; Nonpharmacologic Cognitive; Collaborative Care; Other Treatments; Control The Study Intervention dataset also includes information on the format of the treatment (individual, group, couples, mixed); treatment delivery method (in person, by phone, by video, technology alone, technology assisted, written or mixed); dose or amount of treatment and, treatment completion and adherence. Use this dataset to learn about treatment studies of a particular type Each record is an arm of the study, labeled as A, B, C, or D. Values abstracted as not applicable ("NA") or not reported ("NR") from the study are null values (empty cells).

  17. Transactional Retail Dataset of Electronics Store

    • kaggle.com
    zip
    Updated Jul 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahrayar (2021). Transactional Retail Dataset of Electronics Store [Dataset]. https://www.kaggle.com/muhammadshahrayar/transactional-retail-dataset-of-electronics-store
    Explore at:
    zip(100952 bytes)Available download formats
    Dataset updated
    Jul 20, 2021
    Authors
    Shahrayar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset contains information about an online electronic store. The store has three warehouses from which goods are delivered to customers.

    Columns Description

    • order_id: A unique id for each order
    • customer_id: A unique id for each customer
    • date: The date the order was made, given in YYYY-MM-DD format
    • nearest_warehouse: A string denoting the name of the nearest warehouse to the customer
    • shopping_cart: A list of tuples representing the order items: the first element of the tuple is the item ordered, and the second element is the quantity ordered for such item.
    • order_price: A float denoting the order price in USD. The order price is the price of items before any discounts and/or delivery charges are applied.
    • delivery_charges: A float representing the delivery charges of the order
    • customer_lat: Latitude of the customer’s location
    • customer_long: Longitude of the customer’s location
    • coupon_discount: An integer denoting the percentage discount to be applied to the order_price.
    • order_total: A float denoting the total of the order in USD after all discounts and/or delivery charges are applied.
    • season: A string denoting the season in which the order was placed.
    • is_expedited_delivery: A boolean denoting whether the customer has requested an expedited delivery
    • distance_to_nearest_warehouse: A float representing the arc distance, in kilometres, between the customer and the nearest warehouse to him/her.
    • latest_customer_review: A string representing the latest customer review on his/her most recent order
    • is_happy_customer: A boolean denoting whether the customer is a happy customer or had an issue with his/her last order.

    Inspiration

    Use this dataset to perform graphical and/or non-graphical EDA methods to understand the data first and then find and fix the data problems. - Detect and fix errors in dirty_data.csv - Impute the missing values in missing_data.csv - Detect and remove Anolamies - To check whether a customer is happy with their last order

    All the Best

  18. O

    Standard

    • data.ct.gov
    csv, xlsx, xml
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Energy and Environmental Protection (2025). Standard [Dataset]. https://data.ct.gov/Environment-and-Natural-Resources/Standard/hvpp-uguz
    Explore at:
    csv, xlsx, xmlAvailable download formats
    Dataset updated
    Dec 2, 2025
    Authors
    Department of Energy and Environmental Protection
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    PLEASE NOTE: We know there are errors in the data although we strive to minimize them. Examples include: • Manifests completed incorrectly by the generator or the transporter - data was entered based on the incorrect information. We can only enter the information we receive. • Data entry errors – we now have QA/QC procedures in place to prevent or catch and fix a lot of these. • Historically there are multiple records of the same generator. Each variation in spelling in name or address generated a separate handler record. We have worked to minimize these but many remain. The good news is that as long as they all have the same EPA ID they will all show up in your search results. • Handlers provide erroneous data to obtain an EPA ID - data entry was based on erroneous information. Examples include incorrect or bogus addresses and names. There are also a lot of MISSPELLED NAMES AND ADDRESSES! • Missing manifests – Not every required manifest gets submitted to the DEP. Also, of the more than 100,000 paper manifests we receive each year, some were incorrectly handled and never entered. • Missing data – we know that the records for approximately 25 boxes of manifests, mostly prior to 1985 were lost from the database in the 1980’s. • Translation errors – the data has been migrated to newer data platforms numerous times, and each time there have been errors and data losses. • Wastes incorrectly entered – mostly due to complex names that were difficult to spell, or typos in quantities or units of measure.

  19. w

    Generator Summary View

    • data.wu.ac.at
    csv, json, xml
    Updated May 15, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Energy and Environmental Protection (2018). Generator Summary View [Dataset]. https://data.wu.ac.at/schema/data_ct_gov/NzJtaS0zZjgy
    Explore at:
    json, csv, xmlAvailable download formats
    Dataset updated
    May 15, 2018
    Dataset provided by
    Department of Energy and Environmental Protection
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    PLEASE NOTE: Use ALL CAPS when searching using the "Filter" function on text such as: LITCHFIELD. But not needed for the upper right corner "Find in this Dataset" search where for example "Litchfield" can be used.
    We know there are errors in the data although we strive to minimize them. Examples include: • Manifests completed incorrectly by the generator or the transporter - data was entered based on the incorrect information. We can only enter the information we receive. • Data entry errors – we now have QA/QC procedures in place to prevent or catch and fix a lot of these. • Historically there are multiple records of the same generator. Each variation in spelling in name or address generated a separate handler record. We have worked to minimize these but many remain. The good news is that as long as they all have the same EPA ID they will all show up in your search results. • Handlers provide erroneous data to obtain an EPA ID - data entry was based on erroneous information. Examples include incorrect or bogus addresses and names. There are also a lot of MISSPELLED NAMES AND ADDRESSES! • Missing manifests – Not every required manifest gets submitted to the DEP. Also, of the more than 100,000 paper manifests we receive each year, some were incorrectly handled and never entered. • Missing data – we know that the records for approximately 25 boxes of manifests, mostly prior to 1985 were lost from the database in the 1980’s. • Translation errors – the data has been migrated to newer data platforms numerous times, and each time there have been errors and data losses. • Wastes incorrectly entered – mostly due to complex names that were difficult to spell, or typos in quantities or units of measure.

  20. 4

    Enhancing Proof Assistant Error Messages with Hints: A User Study

    • data.4tu.nl
    zip
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Khakimova; Jesper Cockx; Sára Juhošová; Jaro Reinders (2025). Enhancing Proof Assistant Error Messages with Hints: A User Study [Dataset]. http://doi.org/10.4121/79e7c4eb-81dc-492a-9ac4-69f33166de8e.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    Maria Khakimova; Jesper Cockx; Sára Juhošová; Jaro Reinders
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This repository contains the user study data accompanying the master thesis by Maria Khakimova titled "Enhancing Proof Assistant Error Messages with Hints: A User Study". The goal of the research was to investigate the impacts of hint-based error message enhancements in Agda on novice programmers. To do this, we enhanced three error messages with hints, and conducted a user study.


    In the user study, we asked participants to resolve errors in pre-written Agda code, and rate the helpfulness of the error message. We collected the following data:

    • code compilation status (success/fail),
    • compilation timestamps, and
    • responses to the "Did you find the error message helpful?" question (on a Likert scale).


    This repository contains the programming questions created for the user study, with the accompanying error messages (both original and enhanced) in programming_exercises.zip. We also provide the (anonymised) collected data in JSON format in response-data.json.


    For more details, please read the provided README.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
McCloskey, Nicholas; Martin, Clara D.; Nozari, Nazbanou (2019). Is repairing speech errors an automatic or a controlled process? Insights from the relationship between error and repair probabilities in English and Spanish [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000097521

Data from: Is repairing speech errors an automatic or a controlled process? Insights from the relationship between error and repair probabilities in English and Spanish

Related Article
Explore at:
Dataset updated
Sep 13, 2019
Authors
McCloskey, Nicholas; Martin, Clara D.; Nozari, Nazbanou
Description

Speakers can correct their speech errors, but the mechanisms behind repairs are still unclear. Some findings, such as the speed of repairs and speakers’ occasional unawareness of them, point to an automatic repair process. This paper reports a finding that challenges a purely automatic repair process. Specifically, we show that as error rate increases, so does the proportion of repairs. Twenty highly-proficient English-Spanish bilinguals described dynamic visual events in real time (e.g. “The blue bottle disappears behind the brown curtain”) in English and Spanish blocks. Both error rates and proportion of corrected errors were higher on (a) noun phrase (NP)2 vs. NP1, and (b) word1 (adjective in English and noun in Spanish) vs. word2 within the NP. These results show a consistent relationship between error and repair probabilities, disentangled from position, compatible with a model in which greater control is recruited in error-prone situations to enhance the effectiveness of repair.

Search
Clear search
Close search
Google apps
Main menu