17 datasets found
  1. Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

    • search.datacite.org
    • doi.org
    • +1more
    Updated 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
    Explore at:
    Dataset updated
    2018
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    DataCitehttps://www.datacite.org/
    Authors
    Jacob Kaplan
    Description

    Version 5 release notes:
    Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
    Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
    Version 4 release notes:
    Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
    Version 3 release notes:
    Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
    Fix bug where Philadelphia Police Department had incorrect FIPS county code.
    The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
    All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

    I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

    To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

    To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

    I created 9 arrest categories myself. The categories are:
    Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

    As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

    Index Crimes
    MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
    LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
    Other Sex Offenses
    ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
    Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
    VandalismVagrancy
    Simple
    This data set has every crime and only the arrest categories that I created (see above).
    If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

  2. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • openicpsr.org
    Updated May 19, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Arson 1979-2024 [Dataset]. http://doi.org/10.3886/E103540V13
    Explore at:
    Dataset updated
    May 19, 2018
    Dataset provided by
    Princeton University
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1979 - 2024
    Area covered
    United States
    Description

    For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 13 release notes:Adds 2023-2024 dataVersion 12 release notes:Adds 2022 dataVersion 11 release notes:Adds 2021 data.Version 10 release notes:Adds 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last arson data they release. Changes .rda file to .rds.Version 9 release notes:Changes release notes description, does not change data.Version 8 release notes:Adds 2019 data.Note that the number of months missing variable sharply changes starting in 2018. This is probably due to changes in UCR reporting of the column_2_type variable which is used to generate the months missing county (the code I used does not change). So pre-2018 and 2018+ years may not be comparable for this variable. Version 7 release notes:Adds a last_month_reported column which says which month was reported last. This is actually how the FBI defines number_of_months_reported so is a more accurate representation of that. Removes the number_of_months_reported variable as the name is misleading. You should use the last_month_reported or the number_of_months_missing (see below) variable instead.Adds a number_of_months_missing in the annual data which is the sum of the number of times that the agency reports "missing" data (i.e. did not report that month) that month in the card_2_type variable or reports NA in that variable. Please note that this variable is not perfect and sometimes an agency does not report data but this variable does not say it is missing. Therefore, this variable will not be perfectly accurate.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 4 release notes: Adds 1979-2000, 2006, and 2017 dataAdds agencies that reported 0 months.Adds monthly data.All data now from FBI, not NACJD. Changes some column names so all columns are <=32 characters to be usable in Stata.Version 3 release notes: Add data for 2016.Order rows by year (descending) and ORI.Removed data from Chattahoochee Hills (ORI = "GA06059") from 2016 data. In 2016, that agency reported about 28 times as many vehicle arsons as their population (Total mobile arsons = 77762, population = 2754.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. This Arson data set is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about arsons reported in the United States. The information is the number of arsons reported, to have actually occurred, to not have occurred ("unfounded"), cleared by arrest of at least one arsoning, cleared by arrest where all offenders are under the age of 18, and the cost of the arson. This is done for a number of different arson location categories such as community building, residence, vehicle, and industrial/manufacturing structure. The yearly data sets here combine data from the years 1979-2018 into a single file for each group of crimes. Each monthly file is only a single year as my laptop can't handle combining all the years together. These files are quite large and may take some time to load. I also added state, county, and place FIPS code from the LEAIC (crosswalk).A small number of agencies had some months with clearly incorrect data. I changed the incorrect columns to NA and left the other columns unchanged for that agency. The following are data problems that I fixed - there are still likely issues remaining in the data so make sure to check yourself before running analyses. Oneida, New York (ORI = NY03200) had multiple years that reported single arsons costing over $700 million. I deleted this agency from all years of data.In January 1989 Union, North Carolina (ORI = NC09000) reported 30,000 arsons in uninhabited single occupancy buildings and none any other months. In December 1991 Gadsden, Florida (ORI = FL02000) reported that a single arson at a community/public building caused $99,999,999 in damages (the maximum possible).In April 2017 St. Paul, Minnesota (ORI = MN06209) reported 73,400 arsons in uninhabited storage buildings and 10,000 arsons in uninhabited community/public buildings and one or fewer every other mon

  3. The Device Activity Report with Complete Knowledge (DARCK) for NILM

    • zenodo.org
    bin, xz
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850
    Explore at:
    bin, xzAvailable download formats
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1. Abstract

    This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

    2. Dataset Overview

    • Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
    • Aggregate Meter: eBZ DD3
    • Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
    • Sampling Rate: 1 Hz
    • Measured Quantity: Active Power
    • Unit of Measurement: Watt
    • Duration: 6 months
    • Format: Single CSV file (`DARCK.csv`)
    • Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
    • Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

    3. Download and Usage

    The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

    As it contains longer off periods with zeros, the CSV file is nicely compressible.


    To extract it use: xz -d DARCK.csv.xz.
    The compression leads to a 97% smaller file size (From 4GB to 90.9MB).


    To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

    python
    import pandas as pd

    df = pd.read_csv("DARCK.csv", parse_dates=["time"])

    4. Measurement Setup

    The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

    5. File Format (DARCK.csv)

    The dataset is provided as a single comma-separated value (CSV) file.

    • The first row is a header containing the column names.
    • All power values are rounded to the first decimal place.
    • There are no missing values in the final dataset.
    • Each row represents 1 second, from start of measuring in March until the end in September.

    Column Descriptions

    Column Name

    Data Type

    Unit

    Description

    timedatetime-Timestamp for the reading in YYYY-MM-DD HH:MM:SS
    mainfloatWattTotal aggregate power consumption for the apartment, measured at the main electrical panel.
    [appliance_name]floatWattPower consumption of an individual appliance (e.g., lightbathroom, fridge, sherlockpc). See Section 8 for a full list.
    Aggregate Columns
    aggr_chargersfloatWattThe sum of sherlockcharger, sherlocklaptop, watsoncharger, watsonlaptop, watsonipadcharger, kitchencharger.
    aggr_stoveplatesfloatWattThe sum of stoveplatel1 and stoveplatel2.
    aggr_lightsfloatWattThe sum of lightbathroom, lighthallway, lightsherlock, lightkitchen, lightlivingroom, lightwatson, lightstoreroom, fcob, sherlockalarmclocklight, sherlockfloorlamphue, sherlockledstrip, livingfloorlamphue, sherlockglobe, watsonfloorlamp, watsondesklamp and watsonledmap.
    Analysis Columns
    inaccuracyfloatWattAs no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

    6. Data Postprocessing Pipeline

    The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

    6.1. Main Meter (main) Postprocessing

    The aggregate power data required several cleaning steps to ensure accuracy.

    1. Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
    2. Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
    3. Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

    6.2. Sub-metered Devices (shellies) Postprocessing

    The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

    1. Grouping: Data was grouped by the unique device identifier.
    2. Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
      This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

    6.3. Merging and Finalization

    1. Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
    2. Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

    7. Manual Corrections and Known Data Issues

    During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

    1. March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
    2. May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

    8. Appliance Details and Multipurpose Plugs

    The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

  4. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • openicpsr.org
    Updated May 19, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Arson 1979-2021 [Dataset]. http://doi.org/10.3886/E103540V11
    Explore at:
    Dataset updated
    May 19, 2018
    Dataset provided by
    Princeton University
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1979 - 2020
    Area covered
    United States
    Description

    For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 11 release notes:Adds 2021 data.Version 10 release notes:Adds 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last arson data they release. Changes .rda file to .rds.Version 9 release notes:Changes release notes description, does not change data.Version 8 release notes:Adds 2019 data.Note that the number of months missing variable sharply changes starting in 2018. This is probably due to changes in UCR reporting of the column_2_type variable which is used to generate the months missing county (the code I used does not change). So pre-2018 and 2018+ years may not be comparable for this variable. Version 7 release notes:Adds a last_month_reported column which says which month was reported last. This is actually how the FBI defines number_of_months_reported so is a more accurate representation of that. Removes the number_of_months_reported variable as the name is misleading. You should use the last_month_reported or the number_of_months_missing (see below) variable instead.Adds a number_of_months_missing in the annual data which is the sum of the number of times that the agency reports "missing" data (i.e. did not report that month) that month in the card_2_type variable or reports NA in that variable. Please note that this variable is not perfect and sometimes an agency does not report data but this variable does not say it is missing. Therefore, this variable will not be perfectly accurate.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 4 release notes: Adds 1979-2000, 2006, and 2017 dataAdds agencies that reported 0 months.Adds monthly data.All data now from FBI, not NACJD. Changes some column names so all columns are <=32 characters to be usable in Stata.Version 3 release notes: Add data for 2016.Order rows by year (descending) and ORI.Removed data from Chattahoochee Hills (ORI = "GA06059") from 2016 data. In 2016, that agency reported about 28 times as many vehicle arsons as their population (Total mobile arsons = 77762, population = 2754.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. This Arson data set is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about arsons reported in the United States. The information is the number of arsons reported, to have actually occurred, to not have occurred ("unfounded"), cleared by arrest of at least one arsoning, cleared by arrest where all offenders are under the age of 18, and the cost of the arson. This is done for a number of different arson location categories such as community building, residence, vehicle, and industrial/manufacturing structure. The yearly data sets here combine data from the years 1979-2018 into a single file for each group of crimes. Each monthly file is only a single year as my laptop can't handle combining all the years together. These files are quite large and may take some time to load. I also added state, county, and place FIPS code from the LEAIC (crosswalk).A small number of agencies had some months with clearly incorrect data. I changed the incorrect columns to NA and left the other columns unchanged for that agency. The following are data problems that I fixed - there are still likely issues remaining in the data so make sure to check yourself before running analyses. Oneida, New York (ORI = NY03200) had multiple years that reported single arsons costing over $700 million. I deleted this agency from all years of data.In January 1989 Union, North Carolina (ORI = NC09000) reported 30,000 arsons in uninhabited single occupancy buildings and none any other months. In December 1991 Gadsden, Florida (ORI = FL02000) reported that a single arson at a community/public building caused $99,999,999 in damages (the maximum possible).In April 2017 St. Paul, Minnesota (ORI = MN06209) reported 73,400 arsons in uninhabited storage buildings and 10,000 arsons in uninhabited community/public buildings and one or fewer every other month.When an arson is determined to be unfounded the estimated damage from that arson

  5. s

    Data from: GoiEner smart meters data

    • research.science.eus
    • observatorio-cientifico.ua.es
    • +1more
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris; Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris (2022). GoiEner smart meters data [Dataset]. https://research.science.eus/documentos/668fc48cb9e7c03b01be0b72
    Explore at:
    Dataset updated
    2022
    Authors
    Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris; Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris
    Description

    Name: GoiEner smart meters data Summary: The dataset contains hourly time series of electricity consumption (kWh) provided by the Spanish electricity retailer GoiEner. The time series are arranged in four compressed files: raw.tzst, contains raw time series of all GoiEner clients (any date, any length, may have missing samples). imp-pre.tzst, contains processed time series (imputation of missing samples), longer than one year, collected before March 1, 2020. imp-in.tzst, contains processed time series (imputation of missing samples), longer than one year, collected between March 1, 2020 and May 30, 2021. imp-post.tzst, contains processed time series (imputation of missing samples), longer than one year, collected after May 30, 2020. metadata.csv, contains relevant information for each time series. License: CC-BY-SA Acknowledge: These data have been collected in the framework of the WHY project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891943. Disclaimer: The sole responsibility for the content of this publication lies with the authors. It does not necessarily reflect the opinion of the Executive Agency for Small and Medium-sized Enterprises (EASME) or the European Commission (EC). EASME or the EC are not responsible for any use that may be made of the information contained therein. Collection Date: From November 2, 2014 to June 8, 2022. Publication Date: December 1, 2022. DOI: 10.5281/zenodo.7362094 Other repositories: None. Author: GoiEner, University of Deusto. Objective of collection: This dataset was originally used to establish a methodology for clustering households according to their electricity consumption. Description: The meaning of each column is described next for each file. raw.tzst: (no column names provided) timestamp; electricity consumption in kWh. imp-pre.tzst, imp-in.tzst, imp-post.tzst: “timestamp”: timestamp; “kWh”: electricity consumption in kWh; “imputed”: binary value indicating whether the row has been obtained by imputation. metadata.csv: “user”: 64-character identifying a user; “start_date”: initial timestamp of the time series; “end_date”: final timestamp of the time series; “length_days”: number of days elapsed between the initial and the final timestamps; “length_years”: number of years elapsed between the initial and the final timestamps; “potential_samples”: number of samples that should be between the initial and the final timestamps of the time series if there were no missing values; “actual_samples”: number of actual samples of the time series; “missing_samples_abs”: number of potential samples minus actual samples; “missing_samples_pct”: potential samples minus actual samples as a percentage; “contract_start_date”: contract start date; “contract_end_date”: contract end date; “contracted_tariff”: type of tariff contracted (2.X: households and SMEs, 3.X: SMEs with high consumption, 6.X: industries, large commercial areas, and farms); “self_consumption_type”: the type of self-consumption to which the users are subscribed; “p1”, “p2”, “p3”, “p4”, “p5”, “p6”: contracted power (in kW) for each of the six time slots; “province”: province where the user is located; “municipality”: municipality where the user is located (municipalities below 50.000 inhabitants have been removed); “zip_code”: post code (post codes of municipalities below 50.000 inhabitants have been removed); “cnae”: CNAE (Clasificación Nacional de Actividades Económicas) code for economic activity classification. 5 star: ⭐⭐⭐ Preprocessing steps: Data cleaning (imputation of missing values using the Last Observation Carried Forward algorithm using weekly seasons); data integration (combination of multiple SIMEL files, i.e. the data sources); data transformation (anonymization, unit conversion, metadata generation). Reuse: This dataset is related to datasets: "A database of features extracted from different electricity load profiles datasets" (DOI 10.5281/zenodo.7382818), where time series feature extraction has been performed. "Measuring the flexibility achieved by a change of tariff" (DOI 10.5281/zenodo.7382924), where the metadata has been extended to include the results of a socio-economic characterization and the answers to a survey about barriers to adapt to a change of tariff. Update policy: There might be a single update in mid-2023. Ethics and legal aspects: The data provided by GoiEner contained values of the CUPS (Meter Point Administration Number), which are personal data. A pre-processing step has been carried out to replace the CUPS by random 64-character hashes. Technical aspects: raw.tzst contains a 15.1 GB folder with 25,559 CSV files; imp-pre.tzst contains a 6.28 GB folder with 12,149 CSV files; imp-in.tzst contains a 4.36 GB folder with 15.562 CSV files; and imp-post.tzst contains a 4.01 GB folder with 17.519 CSV files. Other: None.

  6. Car trips data log

    • kaggle.com
    zip
    Updated Nov 17, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vitor R. F. (2017). Car trips data log [Dataset]. https://www.kaggle.com/vitorrf/cartripsdatamining
    Explore at:
    zip(552822421 bytes)Available download formats
    Dataset updated
    Nov 17, 2017
    Authors
    Vitor R. F.
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset contains data acquired from a driver’s driving a vehicle through various conditions. The collected data were used in an attempt to predict driver's behaviour in order to improve gearbox control.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    About RAW DATA: Contents of CSV files:

    file names: yyyy-mm-dd_hh-mm-ss.csv (timestamp of start of data collection)

    Column 1: Time vector in seconds

    Column 2: Engine RPM from OBD sensor Column 3: Car’s speed in km/h Column 4: Calculated engine load (in % of max power)

    Columns 5 - 7: Accelerometer data (XYZ) in G Columns 8 - 10: Gyroscope data (XYZ) in rad/s Columns 11 - 13: Magnetometer data (XYZ)

    file names: yyyy-mm-dd_hh-mm-ss_ext.csv

    Contains data informed by user

    Column 1: Timestamp of parameter change Column 2: Passenger count (0 - 5) Column 3: Car’s load (0 - 10) Column 4: Air conditioning status (0 - 4) Column 5: Window opening (0 - 10) Column 6: Radio volume (0 - 10) Column 7: Rain intensity (0 - 10) Column 8: Visibility (0 - 10) Column 9: Driver’s wellbeing (0 - 10) Column 10: Driver’s rush (0 - 10)

    About PROCESSED DATA:

    Contents of CSV files:

    Column 1: Time (in seconds) Column 2: Vehicle’s speed (in m/s) Column 3: Shift number (0 = intermediate position) Column 4: Engine Load (% of max power) Column 5: Total Acceleration (m/s^2) Column 6: Engine RPM Column 7: Pitch Column 8: Lateral Acceleration (m/s^2) Column 9: Passenger count (0 - 5) Column 10: Car’s load (0 - 10) Column 11: Air conditioning status (0 - 4) Column 12: Window opening (0 - 10) Column 13: Radio volume (0 - 10) Column 14: Rain intensity (0 - 10) Column 15: Visibility (0 - 10) Column 16: Driver’s wellbeing (0 - 10) Column 17: Driver’s rush (0 - 10)

    Inspiration

    How efficiently can an automated gearbox be controlled with predictions based on these variables?

  7. Citi Bike Stations

    • kaggle.com
    zip
    Updated Dec 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ethan Rosenthal (2021). Citi Bike Stations [Dataset]. https://www.kaggle.com/rosenthal/citi-bike-stations
    Explore at:
    zip(4139139012 bytes)Available download formats
    Dataset updated
    Dec 8, 2021
    Authors
    Ethan Rosenthal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    The New York City bikeshare, Citi Bike, has a real time, public API. This API conforms to the General Bikeshare Feed Specification. As such, this API contains information about the number of bikes and docks available at every station in NYC.

    Since 2016, I have been pinging the public API every 2 minutes and storing the results. This dataset contains all of these results, from 8/15/2016 - 12/8/2021. The data unfortunately comes in the form of a bunch of CSVs. I recognize that this is not the best format to read large datasets like this, but a CSV is still a pretty universal format! My suggestion would be to convert these CSVs to parquet or something similar if you plan to do lots of analysis on lots of files.

    Originally, I setup an EC2 instance and pinged a legacy API (code). In 2019, I switched to pinging this API via a Lambda function (code).

    As part of this 2019 switch, I also started pinging the station information API once per week in order to collect information about each station, such as the name, latitude and longitude. While this dataset contains columns for all of the station information, these columns are missing data between 2016 and 8/2019. It would probably be reasonable to backfill that data with the earliest info available for each station, although be warned that this is not guaranteed to be accurate.

    Details

    In order to reduce the individual file size, the full dataset has been bucketed by station_id into 50 separate files. All historical data for a given station_id are in the same file, and the stations are randomly distributed across the 50 files.

    As previously mentioned, station information is missing for all data earlier than 8/2019. I have included a column, missing_station_information to indicate when this information is missing. You may wonder why I don't just create a separate station information file which can be joined to the file containing the time series. The reason is that the station information can technically change over time. When station information is provided in a given row, that information is accurate within sometime 7 days prior. This is because I pinged the station information weekly and then had to join it to the time series.

    The CSV files are the result of a CREATE TABLE AS AWS Athena query using the TEXTFILE format. Consequently, null values are demarcated as \N. The two timestamp columns, station_status_last_reported and station_information_last_updated are in units of POXIX/UNIX time (i.e. seconds since 1970-01-01 00:00:00 UTC). The following code may be helpful to get you started loading the data as a pandas DataFrame.

    def read_csv(filename: str) -> pd.DataFrame:
      """
      Read DataFrame from a CSV file ``filename`` and convert to a 
      preferred schema.
      """
      df = pd.read_csv(
        filename,
        sep=",",
        na_values="\\N",
        dtype={
          "station_id": str,
          # Use Pandas Int16 dtype to allow for nullable integers
          "num_bikes_available": "Int16",
          "num_ebikes_available": "Int16",
          "num_bikes_disabled": "Int16",
          "num_docks_available": "Int16",
          "num_docks_disabled": "Int16",
          "is_installed": "Int16",
          "is_renting": "Int16",
          "is_returning": "Int16",
          "station_status_last_reported": "Int64",
          "station_name": str,
          "lat": float,
          "lon": float,
          "region_id": str,
          "capacity": "Int16",
          # Use pandas boolean dtype to allow for nullable booleans
          "has_kiosk": "boolean",
          "station_information_last_updated": "Int64",
          "missing_station_information": "boolean"
        },
      )
      # Read in timestamps as UNIX/POSIX epochs but then convert to the local
      # bike share timezone.
      df["station_status_last_reported"] = pd.to_datetime(
        df["station_status_last_reported"], unit="s", origin="unix", utc=True
      ).dt.tz_convert("US/Eastern")
    
      df["station_information_last_updated"] = pd.to_datetime(
        df["station_information_last_updated"], unit="s", origin="unix", utc=True
      ).dt.tz_convert("US/Eastern")
      return df
    

    The column names almost come directly from the station_status and station_information APIs. See the [GBFS schema](https://github.com/MobilityData/gbfs...

  8. Data from: Data and code for the publication entitled: tree growth and...

    • dataverse.cirad.fr
    Updated Aug 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CIRAD Dataverse (2022). Data and code for the publication entitled: tree growth and mortality of 42 timber species in Central Africa [Dataset]. http://doi.org/10.18167/DVN1/EBN15Y
    Explore at:
    html(2073677), application/x-r-data(1684963), text/x-r-source(13462), text/x-r-markdown(57769), csv(928), application/x-rlang-transport(2362007), xlsx(61277)Available download formats
    Dataset updated
    Aug 17, 2022
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Central Africa
    Dataset funded by
    This work was supported by the “Fonds Français pour l’Environnement Mondial” (DynAfFor project, convention CZZ1636.01D and CZZ1636.02D, P3FAC project, convention CZZ 2101.01 R).
    Description

    Introduction This archive contains all the necessary data and R script to reproduce the results of the manuscript entitled "tree growth and mortality of 42 timber species in Central Africa" submitted to Forest Ecology and Management journal. It includes cleansed data (text files and Rdata files), computed data (estimates of tree growth and mortality rates, Rdata files), R script to reproduce the computation of the estimates as well as the analyses and figures presented in the main paper and an excel files containing all the supplementary material tables of the manuscript. Cleansed data To produce the cleansed data, raw data was collected for each sites. The different datasets were standardized to store all of them in a single database. Next, consecutive diameter measurements were analyzed and some outliers were discarded (see the explanations in the main manuscript). The cleansed data can be loaded using either text delimited csv files or a Rdata file. It contains the five following tables. Table cleansed_data _species‧csv This table contains information about each study species. Each line corresponds to one species. It contains the following columns: code : species identifying name timber_name : species name as used by the ATIBT species_name_sci : current scientific species name (genus + species) species_name : vernacular species name dme : reference value of the minimum cutting diameter as defined by the Cameroonian Government incr : reference value of diameter increment (cm/year) as defined by the Cameroonian Government cjb_id : species id of the CJB database see_name : species CJB id of synonym names species_name_sci_full : full current scientific name (genus + species + authority) Table cleansed_data _observation_codes‧csv This table contains the description of the codes used in the field to note any particularities of the monitored trees. One line correspond to one code. There are three columns: code : observation code label_fr : French explanation of the code (as used in the field) label_en : English translation of the explanation of the code Table cleansed_data _mortality_codes‧csv This table contains the description of the codes used to characterize the likely cause of recorded tree death. There are three columns: code : mortality code label_fr : French explanation of the code (as used in the field) label_en : English translation of the explanation of the code Table cleansed_data _records‧csv This table contains the information collected for each tree. Each line corresponds to one record for one tree. There are several lines per tree as they were measured several times. It contains the following columns: site : site name id_site : site identifying number id_plot : plot identifying number treatment : treatment (control, exploited or mixed). exploitation_year : year of the exploitation (4 digits) species : species vernacular name (corresponding to species_name column of species table) id_tree : tree identifying number number : tree number (the number that was painted on the tree) id : record identifying number date : record date (yyyy-mm-dd) census_year : year of the census diameter : tree diameter measured at hom (cm) diameter450 : tree diameter measured at 450 cm in height (cm) hom : height of measurement of the diameter (cm) code_observation : observation codes. Multiple codes were sometimes used. They are separated by a dash (corresponding to code column of observation_codes table). code_mortality : mortality codes (corresponding to code column of mortality_codes table) comment : any additional comment Table cleansed_data _Increments‧csv id : id of the initial measurement id_tree : tree id number id_plot : plot id number treatment : treatment (control, exploited) species : species vernacular name (corresponding to species_name of species table) number : tree number (the number that was written on the tree) hom : height of measurement (cm) id_hom : id of HOM (sometimes the HOM had to be changed, e‧x. due to buttresses or wounds) initial_date : date of the first census initial_diameter : the diameter measured at the first census (cm) diameter_increment : The annual diameter increment computed between the two considered census (cm/year) increment_period : The number of years separating the two censuses (years) diameter_observation : Observation codes (corresponding to code of table observation_codes) that were noted during the first and second census. The observation of the two censuses are separated by a “/”. diameter_comment : Additional comments written during the two measurements. They are separated by a “/”. Id_species : species identifying number Id_site : site identifying number Site : name of the site Exploitation_year : year of the exploitation (if any) File cleansed_data.Rdata This Rdata file contains the five tables (species, mortality_codes, observation_codes, records and increment) of the cleansed data It can be used to rapidly load in R. Computed data From the cleansed data, we computed - as explained in the main manuscript - tree growth and mortality rates using an R script (3-computation.R). This script produces “computed data”. The computed data contains six tables that are provided with three additional csv files and one RData files. Computed_data_records‧csv This table is the same as record‧csv but with one additional column: exploitation_date : the assumed date of the exploitation if any (yyyy-mm-dd) Table computed_data_growth_rates‧csv This table contains one line per combination of tree and treatment. It contains the estimates of diameter increment computed over all available records. This table contains the following columns: site : site name id_site : site identifying number treatment : treatment (control or exploited) species : species vernacular name id_plot : plot id number id_tree : tree id number initial_diameter : tree diameter at the begining of the census period (cm) increment_period : length of the census period (year) initial_date : date of the first census (yyyy-mm-dd) diameter_observation : observation codes if any diameter_comment : comment if any exploitation_year : year of the exploitation (4 digits) exploitation_date : assumed date of the last exploitation (if treatement = logged or mixed) mid_point : mid-point of the census period (yyyy-mm-dd) years_after_expl : length of time between the exploitation date and the first measurement n_increment : number of consecutive increment n_hom : number of change of hom during the census period diameter_increment : estimate of the diameter increment (cm/year) Table computed_data_mortality_rates‧csv This table contains estimates of mortality rates for each species and site. This table contains the following columns: id_site : site id number treatment : treatment (control or exploited) time_min : minimum of the length of the census periods time_max : maximum of the length of the census periods time_sd : standard deviation of the length of the census periods -- deleted exploitation_year : exploitation year (if treatment = exploited) years_after_expl_mid : number of years between the assumed exploitation and the mid-period census. years_after_expl_start : number of years between the assumed exploitation and the first census. site : site name species : species vernacular name N0 : number of monitored trees N_surviving : number of surviving trees meantime : mean monitoring period length rate : estimates of the mortality rate lowerCI : lower bound of the confidence interval of the mortaltity rate upperCI : lower bound of the confidence interval of the mortaltity rate File computed_data.Rdata This Rdata file contains the six tables (species, records, growth_rates, mortality_rates, mortality_codes, observation_codes) of the computed data. It can be used to load them in R. Analyses The analyses presented in the main manuscript were produced with a Rmd script (4-analyses.Rmd). This script generates an HTML report (4-analyses.html), as well as the figure that are shown in the manuscript and an Excel file with all the supplementary tables (with one sheet per supplementary table).

  9. Internet Speeds Across Europe

    • kaggle.com
    zip
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Internet Speeds Across Europe [Dataset]. https://www.kaggle.com/thedevastator/internet-speeds-across-europe-quarter-4-2020
    Explore at:
    zip(8277 bytes)Available download formats
    Dataset updated
    Nov 23, 2022
    Authors
    The Devastator
    Description

    Internet Speeds Across Europe

    The Battle for download supremacy

    By Andy Kriebel [source]

    About this dataset

    This dataset contains average internet speeds across Europe for the fourth quarter of 2020. The countries are ranked according to their average download speed, with the fastest country being at the top. The data is a valuable resource for anyone interested in comparing internet speeds across different countries in Europe

    How to use the dataset

    Assuming you would like a guide on how to use the Internet Speeds Across Europe dataset, here are some tips on what you can do with it:

    • The dataset contains information on average internet speeds across Europe for the fourth quarter of 2020. This can be helpful in understanding what regions might have better or worse connectivity.
    • The data is provided at the country level, so you can compare speeds between countries.
    • You can also use the data to compare average speeds within a country across quarters. This could be helpful in understanding if there has been any change over time

    Research Ideas

    • This dataset can be used to analyze internet speeds in different countries across Europe.
    • This dataset can be used to compare internet speeds in different quarters of the year.
    • This dataset can be used to determine which countries have the fastest internet speeds

    Acknowledgements

    Data Source

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

    Columns

    File: Average Internet Speeds Across Europe.csv | Column name | Description | |:---------------------------|:-------------------------------------------------------------------------| | Country | The country the data is from. (String) | | Country Name | The name of the country the data is from. (String) | | quarter | The quarter the data is from. (String) | | average download speed | The average download speed for the country in the given quarter. (Float) | | average upload speed | The average upload speed for the country in the given quarter. (Float) | | average latency | The average latency for the country in the given quarter. (Float) |

    Acknowledgements

    If you use this dataset in your research, please credit Andy Kriebel.

  10. Household Energy Consumption

    • kaggle.com
    zip
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samharison (2025). Household Energy Consumption [Dataset]. https://www.kaggle.com/samxsam/household-energy-consumption
    Explore at:
    zip(748210 bytes)Available download formats
    Dataset updated
    Apr 5, 2025
    Authors
    Samharison
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🏡 Household Energy Consumption - April 2025 (90,000 Records)

    📌 Overview

    This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.

    Column NameData Type CategoryDescription
    Household_IDCategorical (Nominal)Unique identifier for each household
    DateDatetimeThe date of the energy usage record
    Energy_Consumption_kWhNumerical (Continuous)Total energy consumed by the household in kWh
    Household_SizeNumerical (Discrete)Number of individuals living in the household
    Avg_Temperature_CNumerical (Continuous)Average daily temperature in degrees Celsius
    Has_ACCategorical (Binary)Indicates if the household has air conditioning (Yes/No)
    Peak_Hours_Usage_kWhNumerical (Continuous)Energy consumed during peak hours in kWh

    📂 Dataset Summary

    • Rows: 90,000
    • Time Range: April 1, 2025 – April 30, 2025
    • Data Granularity: Daily per household
    • Location: Simulated global coverage
    • Format: CSV (Comma-Separated Values)

    📚 Libraries Used for Working with household_energy_consumption_2025.csv

    🔍 1. Data Manipulation & Analysis

    LibraryPurpose
    pandasReading, cleaning, and transforming tabular data
    numpyNumerical operations, working with arrays

    📊 2. Data Visualization

    LibraryPurpose
    matplotlibCreating static plots (line, bar, histograms, etc.)
    seabornStatistical visualizations, heatmaps, boxplots, etc.
    plotlyInteractive charts (time series, pie, bar, scatter, etc.)

    📈 3. Machine Learning / Modeling

    LibraryPurpose
    scikit-learnPreprocessing, regression, classification, clustering
    xgboost / lightgbmGradient boosting models for better accuracy

    🧹 4. Data Preprocessing

    LibraryPurpose
    sklearn.preprocessingEncoding categorical features, scaling, normalization
    datetime / pandasDate-time conversion and manipulation

    🧪 5. Model Evaluation

    LibraryPurpose
    sklearn.metricsAccuracy, MAE, RMSE, R² score, confusion matrix, etc.

    ✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.

    📈 Potential Use Cases

    This dataset is ideal for a wide variety of analytics and machine learning projects:

    🔮 Forecasting & Time Series Analysis

    • Predict future household energy consumption based on previous trends and weather conditions.
    • Identify seasonal and daily consumption patterns.

    💡 Energy Efficiency Analysis

    • Analyze differences in energy consumption between households with and without air conditioning.
    • Compare energy usage efficiency across varying household sizes.

    🌡️ Climate Impact Studies

    • Investigate how temperature affects electricity usage across households.
    • Model the potential impact of climate change on residential energy demand.

    🔌 Peak Load Management

    • Build models to predict and manage energy demand during peak hours.
    • Support research on smart grid technologies and dynamic pricing.

    🧠 Machine Learning Projects

    • Supervised learning (regression/classification) to predict energy consumption.
    • Clustering households by usage patterns for targeted energy programs.
    • Anomaly detection in energy usage for fault detection.

    🛠️ Example Starter Projects

    • Time-series forecasting using Facebook Prophet or ARIMA
    • Regression modeling using XGBoost or LightGBM
    • Classification of AC vs. non-AC household behavior
    • Energy-saving recommendation systems
    • Heatmaps of temperature vs. energy usage
  11. s

    Data from: Fishing intensity in the Atlantic Ocean (from Global Fishing...

    • research.science.eus
    Updated 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mateo, Maria; Anabitarte Riol, Asier; Granado, Igor; Fernandes, Jose-A.; Mateo, Maria; Anabitarte Riol, Asier; Granado, Igor; Fernandes, Jose-A. (2024). Fishing intensity in the Atlantic Ocean (from Global Fishing Watch) [Dataset]. https://research.science.eus/documentos/67a9c7c919544708f8c7281a
    Explore at:
    Dataset updated
    2024
    Authors
    Mateo, Maria; Anabitarte Riol, Asier; Granado, Igor; Fernandes, Jose-A.; Mateo, Maria; Anabitarte Riol, Asier; Granado, Igor; Fernandes, Jose-A.
    Area covered
    Atlantic Ocean
    Description
    1. MISSION ATLANTIC

    The MISSION ATLANTIC project is an EU-funded initiative that focuses on understanding the impacts of climate change and human activities on these ecosystems. The project aims to map and assess the current and future status of Atlantic marine ecosystems, develop tools for sustainable management, and support ecosystem-based governance to ensure the resilience and sustainable use of ocean resources. The project brings together experts from 33 partner organizations across 14 countries, including Europe, Africa, North, and South America.

    MISSION ATLANTIC includes ten work packages. The present published dataset is included in WP3, which focuses on mapping the pelagic ecosystems, resources, and pressures in the Atlantic Ocean. This WP aims to collect extensive spatial and temporal data to create 3D maps of the water column, identify key vertical ecosystem domains, and assess the pressures from climate change and human activities. More specifically, the dataset corresponds to the fishing intensity presented in the Deliverable 3.2, which integrates data from various sources to map the distribution and dynamics of present ecosystem pressures over time, providing crucial insights for sustainable management strategies.

    1. Data description

    2.1. Data Source

    Fishing intensity estimates from the Global Fishing Watch initiative (GFW) (Kroodsma et al. 2018), who applies machine learning algorithms to data from Automatic Identification Systems (AIS), Vessel Monitoring Systems (VMS), and vessel registries, has been used for the year 2020. This machine learning approach has been able to distinguish between fishing and routing activity of individual vessels, while using pattern recognition to differentiate seven main fishing gear types at the Atlantic Ocean scale (Taconet et al., 2019). The seven main fishing vessel types considered are: trawlers, purse seiners, drifting longliners, set gillnets, squid jiggers, pots and traps, and other. In this work we have aggregated these into pelagic, seabed and passive fishing activities to align with our grouping of ecosystem components.

    The GFW data has some limitations:

    AIS is only required for large vessels. The International Maritime Organization requires AIS use for all vessels of 300 gross tonnage and upward, although some jurisdictions mandate its use in smaller vessels. For example, within the European Union it is required for fishing vessels at least 15m in length. This means that in some areas the fishing intensity estimates will not include the activity of small vessels operating near shore.

    AIS can be intentionally turned off, for example, when vessels carry out illegal fishing activities (Kurekin et al. 2019).

    In the GFW dataset, vessels classified as trawlers include both pelagic and bottom trawlers. As trawlers are included in the bottom fishing category, it is highly likely that the data overestimates the effort on the seafloor and underestimates it on the water column.

    2.2. Data Processing

    1. Data download from the GFW portal.

    2. Using R:

    Add daily files and aggregate fishing hours by fishing gear and coordinates:

    library(data.table)## Load data fileIdx = list.files(".../fleet-daily-csvs-100-v2-2020/", full.names = T)

    Loop

    colsIdx = c("geartype", "hours", "fishing_hours", "x", "y")

    lapply(fileIdx, function(xx) { out = data.table (x = NA_real_, y = NA_real_, geartype = NA_character_) tmp = fread(xx) tmp[, ":=" (y = floor(cell_ll_lat * 10L) / 10L, x = floor(cell_ll_lon * 10L) / 10L)] tmp = tmp[, ..colsIdx] h = tmp[, c(.N, lapply(.SD, sum, na.rm = T)), by = .(x, y, geartype)] outh = data.table::merge.data.table(out, h, by = c("x", "y", "geartype"), all=TRUE) fwrite(outh, ".../GFW_2020_0.1_degrees_and_gear_all.csv", nThread = 14, append = T) })

    Group fishing gears into main fishing groups:

    library(dplyr)library(tidyr)## Load data fishing <- read.csv(".../GFW_2020_0.1_degrees_and_gear_all.csv", sep=",", dec=".", header=T, stringsAsFactors = FALSE)

    Grouping fishing gears (fishing, pelagic, bottom, passive)

    unique(fishing$geartype)

    fishing$group <- NA fishing$group[which(fishing$geartype == "fishing")] = "fishing" # Unknown

    fishing$group[fishing$geartype %in% c("trollers", "squid_jigger", "pole_and_line", "purse_seines", "tuna_purse_seines", "seiners", "other_purse_seines", "other_seines", "set_longlines", "drifting_longlines")] <- "pelagic"

    fishing$group[fishing$geartype %in% c("trawlers", "dredge_fishing")] <- "bottom"

    fishing$group[fishing$geartype %in% c("set_gillnets", "fixed_gear", "pots_and_traps")] <- "passive"

    Total fishing hours (by fishing and position)

    fish_gr <- fishing %>% group_by(x, y, group) %>% summarise(gfishing_hours = sum(fishing_hours))

    Pivot table in order to have fishing groups in columns. Each row corresponds to the coordinates of the left corner of the grid cell (0.1 decimal degrees):

    Pivoting table (fishing groups in columns)

    fish_gr3 <- fish_gr %>% pivot_wider(names_from = "group", values_from = "gfishing_hours", values_fill = 0)

    Saving data (to import in PostgreSQL)

    write.csv(fish_gr3, ".../fishing.csv"), row.names = FALSE)

    Export the table in our PostGIS spatial database using QGis

    1. Using PostgreSQL:

    Create grid cell identifiers (gid):

    -- Generating a gid ALTER TABLE public.fishing ADD COLUMN gid uuid PRIMARY KEY DEFAULT uuid_generate_v4();

    Estimate the centroid of each grid cell:

    -- Create columns ALTER TABLE public.fishing ADD COLUMN cen_lat float; ALTER TABLE public.fishing ADD COLUMN cen_lon float;

    -- Calculate the grid centroid UPDATE public.fishing SET cen_lat = y + 0.05; UPDATE public.fishing SET cen_lon = x + 0.05;

    Create the geometry column based on the estimated centroids to provide the spatial component:

    -- (if necessary) SELECT AddGeometryColumn ('public','fishing','geom', 4326,'POINT',2); UPDATE public.fishing SET geom = ST_SetSRID(ST_MakePoint(cen_lon, cen_lat), 4326); ALTER TABLE public.fishing RENAME COLUMN geom TO geom_point;

    Expand a bounding box in all directions from the centroid geometry to estimate the grid cell (from point to polygon):

    -- Expand a bounding box in all directions from the centroid geometry SELECT AddGeometryColumn ('public','fishing','geom', 4326,'POLYGON', 2); UPDATE public.fishing SET geom = St_Expand(geom_point, 0.05);

    -- Drop deprecated columns ALTER TABLE public.fishing DROP COLUMN geom_point; ALTER TABLE public.fishing DROP COLUMN cen_lat; ALTER TABLE public.fishing DROP COLUMN cen_lon;

    -- Create a spatial index CREATE INDEX ON public.fishing USING gist (geom);

    Estimate the fishing hours per square kilometre by fishing group in each grid cell:

    -- Create columns to estimate fishing hours per km2 ALTER TABLE public.fishing ADD COLUMN pelagic_km numeric, ADD COLUMN bottom_km numeric, ADD COLUMN fishing_km numeric, ADD COLUMN passive_km numeric;

    -- Estimate fishing hours per km2 UPDATE public.fishing SET pelagic_km = pelagic / (ST_Area(geom::geography)/1000000); UPDATE public.fishing SET bottom_km = bottom / (ST_Area(geom::geography)/1000000); UPDATE public.fishing SET fishing_km = fishing / (ST_Area(geom::geography)/1000000); UPDATE public.fishing SET passive_km = passive / (ST_Area(geom::geography)/1000000);

    Select only the Atlantic Ocean area (we have used the boundaries of the Atlantic Ocean to select only the data that fall within it, joining both tables and using ST_Contains() function)

    2.3. Data Output

    The Fishing_Intensity_Mission_Atlantic table corresponds to fishing hours per square kilometre estimated by grid cell (0.1 degree) of the Atlantic Ocean in 2020, and spatially identified by geometry (Spatial Reference System 4326). The attributes associated are:

    gid: grid cell identifier [data type: UUID]

    name: name of the Atlantic Ocean area [data type: character]

    pelagic_km: Pelagic fishing hours per square kilometre [data type: numeric]

    bottom_km: Seabed fishing hours per square kilometre [data type: numeric]

    fishing_km: Unknown fishing hours per square kilometre [data type: numeric]

    passive_km: Passive fishing hours per square kilometre [data type: character]

    geom: grid cell geometry (EPSG: 4326) [data type: geometry]

  12. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • openicpsr.org
    Updated May 19, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Arson 1979-2019 [Dataset]. http://doi.org/10.3886/E103540V9
    Explore at:
    Dataset updated
    May 19, 2018
    Dataset provided by
    University of Pennsylvania
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1979 - 2019
    Area covered
    United States
    Description

    Version 9 release notes:Changes release notes description, does not change data.Version 8 release notes:Adds 2019 data.Note that the number of months missing variable sharply changes starting in 2018. This is probably due to changes in UCR reporting of the column_2_type variable which is used to generate the months missing county (the code I used does not change). So pre-2018 and 2018+ years may not be comparable for this variable. Version 7 release notes:Adds a last_month_reported column which says which month was reported last. This is actually how the FBI defines number_of_months_reported so is a more accurate representation of that. Removes the number_of_months_reported variable as the name is misleading. You should use the last_month_reported or the number_of_months_missing (see below) variable instead.Adds a number_of_months_missing in the annual data which is the sum of the number of times that the agency reports "missing" data (i.e. did not report that month) that month in the card_2_type variable or reports NA in that variable. Please note that this variable is not perfect and sometimes an agency does not report data but this variable does not say it is missing. Therefore, this variable will not be perfectly accurate.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 4 release notes: Adds 1979-2000, 2006, and 2017 dataAdds agencies that reported 0 months.Adds monthly data.All data now from FBI, not NACJD. Changes some column names so all columns are <=32 characters to be usable in Stata.Version 3 release notes: Add data for 2016.Order rows by year (descending) and ORI.Removed data from Chattahoochee Hills (ORI = "GA06059") from 2016 data. In 2016, that agency reported about 28 times as many vehicle arsons as their population (Total mobile arsons = 77762, population = 2754.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. This Arson data set is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about arsons reported in the United States. The information is the number of arsons reported, to have actually occurred, to not have occurred ("unfounded"), cleared by arrest of at least one arsoning, cleared by arrest where all offenders are under the age of 18, and the cost of the arson. This is done for a number of different arson location categories such as community building, residence, vehicle, and industrial/manufacturing structure. The yearly data sets here combine data from the years 1979-2018 into a single file for each group of crimes. Each monthly file is only a single year as my laptop can't handle combining all the years together. These files are quite large and may take some time to load. I also added state, county, and place FIPS code from the LEAIC (crosswalk).A small number of agencies had some months with clearly incorrect data. I changed the incorrect columns to NA and left the other columns unchanged for that agency. The following are data problems that I fixed - there are still likely issues remaining in the data so make sure to check yourself before running analyses. Oneida, New York (ORI = NY03200) had multiple years that reported single arsons costing over $700 million. I deleted this agency from all years of data.In January 1989 Union, North Carolina (ORI = NC09000) reported 30,000 arsons in uninhabited single occupancy buildings and none any other months. In December 1991 Gadsden, Florida (ORI = FL02000) reported that a single arson at a community/public building caused $99,999,999 in damages (the maximum possible).In April 2017 St. Paul, Minnesota (ORI = MN06209) reported 73,400 arsons in uninhabited storage buildings and 10,000 arsons in uninhabited community/public buildings and one or fewer every other month.When an arson is determined to be unfounded the estimated damage from that arson is added as negative to zero out the previously reported estimated damages. This occasionally leads to some agencies have negative values for arson damages. You should be cautious when using the estimated damage columns as some values are quite large. Negative values in other columns are also due to adjustments (

  13. Data from: A Greek Parliament Proceedings Dataset for Computational...

    • data.europa.eu
    unknown
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). A Greek Parliament Proceedings Dataset for Computational Linguistics and Political Analysis [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7005201?locale=es
    Explore at:
    unknown(1427754875)Available download formats
    Dataset updated
    Jun 8, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Greece
    Description

    The dataset is a new version of the previous upload and includes the following files: 1. dataset_versions/tell_all.csv: The initial dataset of 1,280,927 extracted speeches, before preprocessing and cleaning. The speeches extend chronologically from July 1989 up to July 2020 and were exported from 5,355 parliamentary sitting record files. The file has a total volume of 2.5 GB and includes the following columns: member_name: the name of the individual who spoke during a sitting. sitting_date: the date the sitting took place. parliamentary_period: the name and/or number of the parliamentary period that the speech took place in. A parliamentary period is defined as the time span between one general election and the next. A parliamentary period includes multiple parliamentary sessions. parliamentary_session: the name and/or number of the parliamentary session that the speech took place in. A session is defined as a time span of usually 10 months within a parliamentary period during which the parliament can convene and function as stipulated by the constitution. A session can fall into the following categories: regular, extraordinary or special. In the intervals between the sessions the parliament is in recess. A parliamentary session includes multiple parliamentary sittings. parliamentary_sitting: the name and/or number of the parliamentary sitting that the speech took place in. A sitting is defined as a meeting of parliament members. political_party: the political party of the speaker. government: the government in force when the speech took place. member_region: the electoral district the speaker belonged to. roles: information about the parliamentary roles and/or government position of the speaker. member_gender: the gender of the speaker speech: the speech that the individual gave during the parliamentary sitting. 2. dataset_versions/tell_all_FILLED.csv: This file is an intermediate version of the dataset that includes improvements in the consistency and completeness of the dataset, with a total volume of 2.5 GB. Specifically, this file is produced by filling the missing names of chairmen of various parliamentary sittings of the "tell_all.csv". It includes the same columns as the "tell_all.csv" file. 3. dataset_versions/tell_all_cleaned.csv: This version of the dataset is the result of further cleaning and preprocessing and is used for our word usage change study. It consists of 1,280,918 speech fragments of Greek parliament members in the order of the conversation that took place, with a total volume of 2.12 GB. It includes the same columns as the aforementioned versions. The preprocessing includes the replacement of all references to political parties with the symbol "@" followed by an abbreviation of the party name, using regular expressions that capture different grammatical cases and variations. It also includes the removal of accents, strings with length less than 2 characters, all punctuation except full stops, and the replacement of stopwords with "@sw". 4. wiki_data: A folder of modern Greek female and male names and surnames and their available grammatical cases crawled from the entries of the Wiktionary Greek names category (https://en.wiktionary.org/wiki/Category:Greek_names). We produced the grammatical cases of the missing grammatical entries according to the rules of the Greek grammar and saved the files in the same folder by adding to their filenames the string "_populated.json". 5. parl_members_activity_1989onwards_with_gender.csv: The Greek Parliament website provides a list of all the elected members of parliament since the fall of the military junta in Greece, in 1974. We collected and cleaned the data, added the gender and kept the elected members from 1989 onwards, matching the available parliament proceeding records. This dataset includes the full names of the members, the date range of their service, the political party they served, the electoral district they belonged to and their gender. 6. formatted_roles_gov_members_data.csv: As government members we refer to individuals in ministerial or other government posts, regardless of whether they were elected in the parliament. This information is available in the website of the Secretariat General for Legal and Parliamentary Affairs. The government members dataset includes the full names of the official individuals, the name of the role they were given, the date range of their service at each specific role and their gender. 7. governments_1989onwards.csv: A dataset of government information including the names of governments since 1989, their start and end dates, and a URL that points to the respective official government web page of each past government. The data is crawled from the website of the Secretariat General for Legal and Parliamentary Affairs. 8. extra_roles_manually_collected.csv: A dataset with manually collected information from Wikipedia about additional government or parliament posts such as Chairman of the Parliament,

  14. Valorant Weapons

    • kaggle.com
    zip
    Updated May 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Fouqué (2020). Valorant Weapons [Dataset]. https://www.kaggle.com/equino/valorant-weapons-stats
    Explore at:
    zip(706 bytes)Available download formats
    Dataset updated
    May 28, 2020
    Authors
    Nicolas Fouqué
    Description

    Context

    I was going to crunch some data to see if some weapons are better than other, so I figured why not upload the data here

    Content

    I think the column names are mostly self explanatory For weapons that had only 2 different ranges I took the damages at 25m as the mid-range

    Acknowledgements

    Data was found here : https://blitz.gg/valorant/weapons I take no credits and guarantee no accuracy, I just put it in a csv
    About the License I don't think in-game data falls under any law, please tell me if I'm mistaken

    Inspiration

    If you have any suggestion about a column I could add why not post it here and I could add it In case there are patches that change those values notify me and I'll update the numbers, be sure to check the date of last update for that matter

  15. U.S. Electricity Prices

    • kaggle.com
    zip
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alistair King (2024). U.S. Electricity Prices [Dataset]. https://www.kaggle.com/datasets/alistairking/electricity-prices
    Explore at:
    zip(1553011 bytes)Available download formats
    Dataset updated
    Apr 7, 2024
    Authors
    Alistair King
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Area covered
    United States
    Description

    US Electricity Prices and Sales by State, Sector, and Year

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F8734253%2Fdba0dac3571f37e79f2891a6ffd80d5c%2Fus%20electric%20flag.png?generation=1712518711362350&alt=media" alt=""> This comprehensive dataset offers a detailed look at the United States electricity market, providing valuable insights into prices, sales, and revenue across various states, sectors, and years. With data spanning from 2001 onwards to 2024, this dataset is a powerful tool for analyzing the complex dynamics of the US electricity market and understanding how it has evolved over time.

    The dataset includes eight key variables: | Column Name | Description | |-------|-------| | year | The year of the observation | | month | The month of the observation | | stateDescription | The name of the state | | sectorName | The sector of the electricity market (residential, commercial, industrial, other, or all sectors) | | customers | The number of customers (missing for some observations) | | price | The average price of electricity per kilowatt-hour (kWh) in cents | | revenue | The total revenue generated from electricity sales in millions of dollars | | sales | The total electricity sales in millions of kilowatt-hours (kWh) |

    By providing such granular data, this dataset enables users to conduct in-depth analyses of electricity market trends, comparing prices and consumption patterns across different states and sectors, and examining the impact of seasonality on demand and prices.

    One of the primary applications of this dataset is in forecasting future electricity prices and sales based on historical trends. By leveraging the extensive time series data available, researchers and analysts can develop sophisticated models to predict how prices and demand may change in the coming years, taking into account factors such as economic growth, population shifts, and policy changes. This predictive power is invaluable for policymakers, energy companies, and investors looking to make informed decisions in the rapidly evolving electricity market.

    Another key use case for this dataset is in investigating the complex relationships between electricity prices, sales volumes, and revenue. By combining the price, sales, and revenue data, users can explore how changes in prices impact consumer behavior and utility company bottom lines. This analysis can shed light on important questions such as the price elasticity of electricity demand, the effectiveness of energy efficiency programs, and the potential impact of new technologies like renewable energy and energy storage on the market.

    Beyond its immediate applications in the energy sector, this dataset also has broader implications for understanding the US economy and society as a whole. Electricity is a critical input for businesses and households across the country, and changes in electricity prices and consumption can have far-reaching effects on economic growth, competitiveness, and quality of life. By providing such a rich and detailed portrait of the US electricity market, this dataset opens up new avenues for research and insights that can inform public policy, business strategy, and academic inquiry.

    I hope you all enjoy using this dataset and find it useful! 🤗

  16. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • openicpsr.org
    Updated Mar 29, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1974-2019 [Dataset]. http://doi.org/10.3886/E102263V12
    Explore at:
    Dataset updated
    Mar 29, 2018
    Dataset provided by
    Princeton University
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1974 - 2019
    Area covered
    United States
    Description

    For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 12 release notes:Adds 2019 data.Version 11 release notes:Changes release notes description, does not change data.Version 10 release notes:The data now has the following age categories (which were previously aggregated into larger groups to reduce file size): under 10, 10-12, 13-14, 40-44, 45-49, 50-54, 55-59, 60-64, over 64. These categories are available for female, male, and total (female+male) arrests. The previous aggregated categories (under 15, 40-49, and over 49 have been removed from the data). Version 9 release notes:For each offense, adds a variable indicating the number of months that offense was reported - these variables are labeled as "num_months_[crime]" where [crime] is the offense name. These variables are generated by the number of times one or more arrests were reported per month for that crime. For example, if there was at least one arrest for assault in January, February, March, and August (and no other months), there would be four months reported for assault. Please note that this does not differentiate between an agency not reporting that month and actually having zero arrests. The variable "number_of_months_reported" is still in the data and is the number of months that any offense was reported. So if any agency reports murder arrests every month but no other crimes, the murder number of months variable and the "number_of_months_reported" variable will both be 12 while every other offense number of month variable will be 0. Adds data for 2017 and 2018.Version 8 release notes:Adds annual data in R format.Changes project name to avoid confusing this data for the ones done by NACJD.Fixes bug where bookmaking was excluded as an arrest category. Changed the number of categories to include more offenses per category to have fewer total files. Added a "total_race" file for each category - this file has total arrests by race for each crime and a breakdown of juvenile/adult by race. Version 7 release notes: Adds 1974-1979 dataAdds monthly data (only totals by sex and race, not by age-categories). All data now from FBI, not NACJD. Changes some column names so all columns are <=32 characters to be usable in Stata.Changes how number of months reported is calculated. Now it is the number of unique months with arrest data reported - months of data from the monthly header file (i.e. juvenile disposition data) are not considered in this calculation. Version 6 release notes: Fix bug where juvenile female columns had the same value as juvenile male columns.Version 5 release notes: Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.Version 4 release notes: Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics. Version 3 release notes: Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Arrests by Age, Sex, and Race (ASR) data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1974-2019 into a single file for each group of crimes. Each monthly file is only a single year as my laptop can't handle combining all the years together. These files are quite large and may take some time to load. Col

  17. Mexico COVID-19 clinical data

    • kaggle.com
    zip
    Updated Jun 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariana R Franklin (2020). Mexico COVID-19 clinical data [Dataset]. https://www.kaggle.com/datasets/marianarfranklin/mexico-covid19-clinical-data/code
    Explore at:
    zip(6399963 bytes)Available download formats
    Dataset updated
    Jun 5, 2020
    Authors
    Mariana R Franklin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Mexico
    Description

    Mexico COVID-19 clinical data 🦠🇲🇽

    This dataset contains the results of real-time PCR testing for COVID-19 in Mexico as reported by the [General Directorate of Epidemiology](https://www.gob.mx/salud/documentos/datos-abiertos-152127).

    The official, raw dataset is available in the Official Secretary of Epidemiology website: https://www.gob.mx/salud/documentos/datos-abiertos-152127.

    You might also want to download the official column descriptors and the variable definitions - e.g. SEXO=1 -> Female; SEXO=2 -> Male; SEXO=99 -> Undisclosed) - in the following [zip file](http://datosabiertos.salud.gob.mx/gobmx/salud/datos_abiertos/diccionario_datos_covid19.zip). I've maintained the original levels as described in the official dataset, unless otherwise specified.

    IMPORTANT: This dataset has been maintained since the original data releases, which weren't tabular, but rather consisted of PDF files, often with many/different inconsistencies which had to be resolved carefully and is annotated in the .R script. More later datasets should be more reliable, but earlier there were a lot of things to figure out like e.g. when the official methodology to assign the region of the case was changed to be based on residence rather than origin). I've added more notes on very early data here: https://github.com/marianarf/covid19_mexico_data.

    [More official information here](https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico/resource/e8c7079c-dc2a-4b6e-8035-08042ed37165).

    Motivation

    I hope that this data serves to as a base to understand the clinical symptoms 🔬that characterize a COVID-19 positive case from another viral respiratory disease and help expand the knowledge about COVID-19 worldwide.

    👩‍🔬🧑‍🔬🧪

    With more models tested, added features and fine-tuning, clinical data could be used to predict a patient with pending COVID-19 results will get a positive or a negative result in two scenarios:

    • As lab results are processed, this leaves a window when it's uncertain whether a result will return positive or negative (this is merely didactic, as new reports will corroborate the prediction as soon as the laboratory data for missing cases is reported).
    • More importantly, it could help predict for similar symptoms e.g. from a survey or an app that checks for similar data (ideally, containing most of the parameters that can be assessed without using variables only available after hospitalization, like e.g. age of the person which is readily available).

    The value of the lab result comes from a RT-PCR, and is stored in RESULTADO, where the original data is encoded 1 = POSITIVE and 2 = NEGATIVE.

    Source

    The data was gathered using a "sentinel model" that samples 10% of the patients that present a viral respiratory diagnosis to test for COVID-19, and consists of data reported by 475 viral respiratory disease monitoring units (hospitals) named USMER (Unidades Monitoras de Enfermedad Respiratoria Viral) throughout the country in the entire health sector (IMSS, ISSSTE, SEDENA, SEMAR, and others).

    Preprocess

    Data is first processed with this [this .R script](https://github.com/marianarf/covid19_mexico_analysis/blob/master/notebooks/preprocess.R). The file containing the processed data will be updated daily until. Important: Since the data is updated to Github, assume the data uploaded here isn't the latest version, and instead, load data directly from the 'csv' [in this github repository](https://raw.githubusercontent.com/marianarf/covid19_mexico_analysis/master/mexico_covid19.csv).

    • The data aggregates official daily reports of patients admitted in COVID-19 designated units.
    • New cases are usually concatenated at the end of the file, but each individual case also contains a unique (official) identifier 'ID_REGISTRO' as well as a (new) unique reference 'id' to remove duplicates.
    • I fixed a specific change in methodology in reporting, where the patient record used to be assigned in ENTIDAD_UM (the region of the medical unit) but now uses ENTIDAD_RES (the region of residence of the patient).
    Note: I have preserved the original structure (column names and factors) as closely as possible to the official data, so that code is reproducible in cross-reference to the official sources. ### Added features

    In addition to original features reported, I've included missing regional names and also a field 'DELAY' which corresponds to the lag in the processing lab results (since new data contains records from the previous day, this allows to keep track of this lag).

    Additional info

    ...

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Organization logoOrganization logo

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
2018
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
DataCitehttps://www.datacite.org/
Authors
Jacob Kaplan
Description

Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

Search
Clear search
Close search
Google apps
Main menu