16 datasets found

Data from: KORUS-AQ Aircraft Merge Data Files
catalog.data.gov
s.cnmilf.com
+1more
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NASA/LARC/SD/ASDC (2025). KORUS-AQ Aircraft Merge Data Files [Dataset]. https://catalog.data.gov/dataset/korus-aq-aircraft-merge-data-files
Explore at:
Dataset updated
Jul 3, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
KORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.
g
KORUS-AQ Aircraft Merge Data Files | gimi9.com
gimi9.com
Updated Sep 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). KORUS-AQ Aircraft Merge Data Files | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_korus-aq-aircraft-merge-data-files/
Explore at:
Dataset updated
Sep 30, 2018
Description
KORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.
m
R codes and dataset for Visualisation of Diachronic Constructional Change...
bridges.monash.edu
researchdata.edu.au
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
May 30, 2023
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
d
Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...
catalog.data.gov
data.usgs.gov
+1more
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
Explore at:
Dataset updated
Feb 22, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Contiguous United States
Description
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...
search.datacite.org
openicpsr.org
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Explore at:
Unique identifier
https://doi.org/10.3886/e102263v5-10021
Dataset updated
2018
Dataset provided by
DataCitehttps://www.datacite.org/
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
Jacob Kaplan
Description
Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
W
CLM16gwl NSW Office of Water_GW licence extract linked to spatial...
cloud.csiss.gmu.edu
researchdata.edu.au
+2more
Updated Dec 13, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Australia (2019). CLM16gwl NSW Office of Water_GW licence extract linked to spatial locations_CLM_v3_13032014 [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/4b0e74ed-2fad-4608-a743-92163e13c30d
Explore at:
Dataset updated
Dec 13, 2019
Dataset provided by
Australia
Area covered
New South Wales
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.

The difference between NSW Office of Water GW licences - CLM v2 and v3 is that an additional column has been added, 'Asset Class' that aggregates the purpose of the licence into the set classes for the Asset Database. Also the 'Completed_Depth' has been added, which is the total depth of the groundwater bore. These columns were added for the purpose of the Asset Register.

The aim of this dataset was to be able to map each groundwater works with the volumetric entitlement without double counting the volume and to aggregate/ disaggregate the data depending on the final use.

This has not been clipped to the CLM PAE, therefore the number of economic assets/ relevant licences will drastically reduce once this occurs.

The Clarence Moreton groundwater licences includes an extract of all licences that fell within the data management acquisition area as provided by BA to NSW Office of Water.

Aim: To get a one to one ratio of licences numbers to bore IDs.

Important notes about data:

Data has not been clipped to the PAE.

No decision have been made in regards to what purpose of groundwater should be protected. Therefore the purpose currently includes groundwater bores that have been drilled for non-extractive purposes including experimental research, test, monitoring bore, teaching, mineral explore and groundwater explore

No volume has been included for domestic & stock as it is a basic right. Therefore an arbitrary volume could be applied to account for D&S use.

Licence Number - Each sheet in the Original Data has a licence number, this is assumed to be the actual licence number. Some are old because they have not been updated to the new WA. Some are new (From_Spreadsheet_WALs). This is the reason for the different codes.

WA/CA - This number is the 'works' number. It is assumed that the number indicates the bore permit or works approval. This is why there can be multiple works to licence and licences to works number. (For complete glossary see here http://registers.water.nsw.gov.au/wma/Glossary.jsp). Originally, the aim was to make sure that the when there was more than more than one licence to works number or mulitple works to licenes that the mulitple instances were compelte.

Clarence Moreton worksheet links the individual licence to a works and a volumetric entitlement. For most sites, this can be linked to a bore which can be found in the NGIS through the HydroID. (\wron\Project\BA\BA_all\Hydrogeology_National_Groundwater_Information_System_v1.1_Sept2013). This will allow analysis of depths, lithology and hydrostratigraphy where the data exists.

We can aggregate the data based on water source and water management zone as can be seen in the other worksheets.

Data available:

Original Data: Any data that was bought in from NSW Offcie of Water, includes

Spatial locations provided by NoW- This is a exported data from the submitted shape files. Includes the licence (LICENCE) numbers and the bore ID (WORK_NUO). (Refer to lineage NSW Office of Water Groundwater Entitlements Spatial Locations).

Spreadsheet_WAL - The spread sheet from the submitted data, WLS-EXTRACT_WALs_volume. (Refer to Lineage NSW Office of Water Groundwater Licence Extract CLM- Oct 2013)

WLS_extracts - The combined spread sheets from the submitted data, WLS-EXTRACT . (Refer to Lineage NSW Office of Water Groundwater Licence Extract CLM- Oct 2013)

Aggregated share component to water sharing plan, water source and water management zone

Dataset History

The difference between NSW Office of Water GW licences - CLM v2 and v3 is that an additional column has been added, 'Asset Class' that aggregates the purpose of the licence into the set classes for the Asset Database.

Where purpose = domestic; or domestic & stock; or stock then it was classed as 'basic water right'. Where it is listed as both a domestic/stock and a licensed use such as irrigation, it was classed as a 'water access right.' All other take and use were classed as a 'water access right'. Where purpose = drainage, waste disposal, groundwater remediation, experimental research, null, conveyancing, test bore - these were not given an asset class. Monitoring bores were classed as 'Water supply and monitoring infrastructure'

Depth has also been included which is the completed depth of the bore.

Instructions

Procedure: refer to Bioregional assessment data conversion script.docx

1) Original spread sheets have mulitple licence instances if there are more than one WA/CA number. This means that there are more than one works or permit to the licence. The aim is to only have one licence instance.

2) The individual licence numbers were combined into one column

3) Using the new column of licence numbers, several vlookups were created to bring in other data. Where the columns are identical in the original spreadsheets, they are combined. The only ones that don't are the Share/Entitlement/allocation, these mean different things.

4) A hydro ID column was created, this is a code that links this NSW to the NGIS, which is basically a ".1.1" at the end of the bore code.

5) All 'cancelled' licences were removed

6) A count of the number of works per licence and number of bores were included in the spreadsheet.

7) Where the ShareComponent = NA, the Entitlement = 0, Allocation = 0 and there was more than one instance of the same bore, this means that the original licence assigned to the bore has been replaced by a new licence with a share component. Where these criteria were met, the instances were removed

8) a volume per works ensures that the volume of the licence is not repeated for each works, but is divided by the number of works

Bioregional assessment data conversion script

Aim: The following document is the R-Studio script for the conversion and merging of the bioregional assessment data.

Requirements: The user will need R-Studio. It would be recommended that there is some basic knowledge of R. If there isn't, the only thing that would really need to be changed is the file location and name. The way that R reads files is different to windows and also the locations that R-Studio read is dependent on where R-Studio is originally installed to point. This would need to be completed properly before the script can be run.

Procedure: The information below the dashed line is the script. This can be copied and pasted directly into R-Studio. Any text with '#' will not be read as a script, so that can be added in and read as an instruction.

###########

# 18/2/2014

# Code by Brendan Dimech

#

# Script to merge extract files from submitted NSW bioregional

# assessment and convert data into required format. Also use a 'vlookup'

# process to get Bore and Location information from NGIS.

#

# There are 3 scripts, one for each of the individual regions.

#

############

# CLARENCE MORTON

# Opening of files. Location can be changed if needed.

# arc.file is the exported *.csv from the NGIS data which has bore data and Lat/long information.

# Lat/long weren't in the file natively so were added to the table using Arc Toolbox tools.

arc.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Data'

arc.file = "Moreton.csv"

# Files from NSW came through in two types. WALS files, this included 'newer' licences that had a share component.

# The 'OTH' files were older licences that had just an allocation. Some data was similar and this was combined,

# and other information that wasn't similar from the datasets was removed.

# This section is locating and importing the WALS and OTH files.

WALS.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Data'

WALS.file = "GW_Clarence_Moreton_WLS-EXTRACT_4_WALs_volume.xls"

OTH.file.1 = "GW_Clarence_Moreton_WLS-EXTRACT_1.xls"

OTH.file.2 = "GW_Clarence_Moreton_WLS-EXTRACT_2.xls"

OTH.file.3 = "GW_Clarence_Moreton_WLS-EXTRACT_3.xls"

OTH.file.4 = "GW_Clarence_Moreton_WLS-EXTRACT_4.xls"

newWALS.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Products'

newWALS.file = "Clarence_Moreton.csv"

arc <- read.csv(paste(arc.folder, arc.file, sep="/" ), header =TRUE, sep = ",")

WALS <- read.table(paste(WALS.folder, WALS.file, sep="/" ), header =TRUE, sep = "\t")

# Merge any individual WALS and OTH files into a single WALS or OTH file if there were more than one.

OTH1 <- read.table(paste(WALS.folder, OTH.file.1, sep="/" ), header =TRUE, sep = "\t")

OTH2 <- read.table(paste(WALS.folder, OTH.file.2, sep="/" ), header =TRUE, sep = "\t")

OTH3 <- read.table(paste(WALS.folder, OTH.file.3, sep="/" ), header =TRUE, sep = "\t")

OTH4 <- read.table(paste(WALS.folder, OTH.file.4, sep="/" ), header =TRUE, sep = "\t")

OTH <- merge(OTH1,OTH2, all.y = TRUE, all.x = TRUE)

OTH <- merge(OTH,OTH3, all.y = TRUE, all.x = TRUE)

OTH <- merge(OTH,OTH4, all.y = TRUE, all.x = TRUE)

# Add new columns to OTH for the BORE, LAT and LONG. Then use 'merge' as a vlookup to add the corresponding

# bore and location from the arc file. The WALS and OTH files are slightly different because the arc file has

# a different licence number added in.

OTH <- data.frame(OTH, BORE = "", LAT = "", LONG = "")

OTH$BORE <- arc$WORK_NO[match(OTH$LICENSE.APPROVAL, arc$LICENSE)]

OTH$LAT <- arc$POINT_X[match(OTH$LICENSE.APPROVAL, arc$LICENSE)]

OTH$LONG <-
g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
Explore at:
Unique identifier
https://doi.org/10.3886/E105403V3
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
o
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
openicpsr.org
Updated May 18, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1991-2022 [Dataset]. http://doi.org/10.3886/E103500V10
Explore at:
Unique identifier
https://doi.org/10.3886/E103500V10
Dataset updated
May 18, 2018
Dataset provided by
Princeton University
Authors
Jacob Kaplan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1991 - 2021
Area covered
United States
Description
!!!WARNING~~~This dataset has a large number of flaws and is unable to properly answer many questions that people generally use it to answer, such as whether national hate crimes are changing (or at least they use the data so improperly that they get the wrong answer). A large number of people using this data (academics, advocates, reporting, US Congress) do so inappropriately and get the wrong answer to their questions as a result. Indeed, many published papers using this data should be retracted. Before using this data I highly recommend that you thoroughly read my book on UCR data, particularly the chapter on hate crimes (https://ucrbook.com/hate-crimes.html) as well as the FBI's own manual on this data. The questions you could potentially answer well are relatively narrow and generally exclude any causal relationships. ~~~WARNING!!!For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 10 release notes:Adds 2022 dataVersion 9 release notes:Adds 2021 data.Version 8 release notes:Adds 2019 and 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last UCR hate crime data they release. Changes .rda file to .rds.Version 7 release notes:Changes release notes description, does not change data.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), made all character values lower case, reordered columns. I also generated incident month, weekday, and month-day variables from the incident date variable included in the original data.
g
Uniform Crime Reporting Program Data: Offenses Known and Clearances by...
datasearch.gesis.org
Updated Jun 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2018). Uniform Crime Reporting Program Data: Offenses Known and Clearances by Arrest, 1960-2016 [Dataset]. http://doi.org/10.3886/E100707V3-5862
Explore at:
Unique identifier
https://doi.org/10.3886/E100707V3-5862
Dataset updated
Jun 12, 2018
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
This version (V3) fixes a bug in Version 2 where 1993 data did not properly deal with missing values, leading to enormous counts of crime being reported. This is a collection of Offenses Known and Clearances By Arrest data from 1960 to 2016. The monthly zip files contain one data file per year(57 total, 1960-2016) as well as a codebook for each year. These files have been read into R using the ASCII and setup files from ICPSR (or from the FBI for 2016 data) using the package asciiSetupReader. The end of the zip folder's name says what data type (R, SPSS, SAS, Microsoft Excel CSV, feather, Stata) the data is in. Due to file size limits on open ICPSR, not all file types were included for all the data. The files are lightly cleaned. What this means specifically is that column names and value labels are standardized. In the original data column names were different between years (e.g. the December burglaries cleared column is "DEC_TOT_CLR_BRGLRY_TOT" in 1975 and "DEC_TOT_CLR_BURG_TOTAL" in 1977). The data here have standardized columns so you can compare between years and combine years together. The same thing is done for values inside of columns. For example, the state column gave state names in some years, abbreviations in others. For the code uses to clean and read the data, please see my GitHub file here. https://github.com/jacobkap/crime_data/blob/master/R_code/offenses_known.RThe zip files labeled "yearly" contain yearly data rather than monthly. These also contain far fewer descriptive columns about the agencies in an attempt to decrease file size. Each zip folder contains two files: a data file in whatever format you choose and a codebook. The data file is aggregated yearly and has already combined every year 1960-2016. For the code I used to do this, see here https://github.com/jacobkap/crime_data/blob/master/R_code/yearly_offenses_known.R.If you find any mistakes in the data or have any suggestions, please email me at jkkaplan6@gmail.comAs a description of what UCR Offenses Known and Clearances By Arrest data contains, the following is copied from ICPSR's 2015 page for the data.The Uniform Crime Reporting Program Data: Offenses Known and Clearances By Arrest dataset is a compilation of offenses reported to law enforcement agencies in the United States. Due to the vast number of categories of crime committed in the United States, the FBI has limited the type of crimes included in this compilation to those crimes which people are most likely to report to police and those crimes which occur frequently enough to be analyzed across time. Crimes included are criminal homicide, forcible rape, robbery, aggravated assault, burglary, larceny-theft, and motor vehicle theft. Much information about these crimes is provided in this dataset. The number of times an offense has been reported, the number of reported offenses that have been cleared by arrests, and the number of cleared offenses which involved offenders under the age of 18 are the major items of information collected.
g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2018 [Dataset]. http://doi.org/10.3886/E105403
Explore at:
Unique identifier
https://doi.org/10.3886/E105403
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 4 release notes:Adds data for 2018Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
Video game pricing analytics dataset
kaggle.com
Updated Sep 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivi Deveshwar (2023). Video game pricing analytics dataset [Dataset]. https://www.kaggle.com/datasets/shivideveshwar/video-game-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shivi Deveshwar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The review dataset for 3 video games - Call of Duty : Black Ops 3, Persona 5 Royal and Counter Strike: Global Offensive was taken through a web scrape of SteamDB [https://steamdb.info/] which is a large repository for game related data such as release dates, reviews, prices, and more. In the initial scrape, each individual game has two files - customer reviews (Count: 100 reviews) and price time series data.

To obtain data on the reviews of the selected video games, we performed web scraping using R software. The customer reviews dataset contains the date that the review was posted and the review text, while the price dataset contains the date that the price was changed and the price on that date. In order to clean and prepare the data we first start by sectioning the data in excel. After scraping, our csv file fits each review in one row with the date. We split the data, separating date and review, allowing them to have separate columns. Luckily scraping the price separated price and date, so after the separating we just made sure that every file had similar column names.

After, we use R to finish the cleaning. Each game has a separate file for prices and review, so each of the prices is converted into a continuous time series by extending the previously available price for each date. Then the price dataset is combined with its respective in R on the common date column using left join. The resulting dataset for each game contains four columns - game name, date, reviews and price. From there, we allow the user to select the game they would like to view.
o
Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1992-2017
openicpsr.org
Updated May 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1992-2017 [Dataset]. http://doi.org/10.3886/E103500V4
Explore at:
Unique identifier
https://doi.org/10.3886/E103500V4
Dataset updated
May 18, 2018
Dataset provided by
University of Pennsylvania
Authors
Jacob Kaplan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1992 - 2017
Area covered
United States
Description
Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. The data sets here combine all data from the years 1992-2015 into a single file. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), changed the name of some UCR offense codes (e.g. from "agg asslt" to "aggravated assault"), made all character values lower case, reordered columns. I also added state, county, and place FIPS code from the LEAIC (crosswalk) and generated incident month, weekday, and month-day variables from the incident date variable included in the original data. The zip file contains the data in the following formats and a codebook: .dta - Stata.rda - RIf you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
o
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
openicpsr.org
search.datacite.org
Updated May 18, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1991-2017 [Dataset]. http://doi.org/10.3886/E103500V5
Explore at:
Unique identifier
https://doi.org/10.3886/E103500V5
Dataset updated
May 18, 2018
Dataset provided by
University of Pennsylvania
Authors
Jacob Kaplan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1991 - 2017
Area covered
United States
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), changed the name of some UCR offense codes (e.g. from "agg asslt" to "aggravated assault"), made all character values lower case, reordered columns. I also added state, county, and place FIPS code from the LEAIC (crosswalk) and generated incident month, weekday, and month-day variables from the incident date variable included in the original data.
Z
Data from: Open-data release of aggregated Australian school-level...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monteiro Lobato, (2020). Open-data release of aggregated Australian school-level information. Edition 2016.1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_46086
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Monteiro Lobato,
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The file set is a freely downloadable aggregation of information about Australian schools. The individual files represent a series of tables which, when considered together, form a relational database. The records cover the years 2008-2014 and include information on approximately 9500 primary and secondary school main-campuses and around 500 subcampuses. The records all relate to school-level data; no data about individuals is included. All the information has previously been published and is publicly available but it has not previously been released as a documented, useful aggregation. The information includes: (a) the names of schools (b) staffing levels, including full-time and part-time teaching and non-teaching staff (c) student enrolments, including the number of boys and girls (d) school financial information, including Commonwealth government, state government, and private funding (e) test data, potentially for school years 3, 5, 7 and 9, relating to an Australian national testing programme know by the trademark 'NAPLAN'

Documentation of this Edition 2016.1 is incomplete but the organization of the data should be readily understandable to most people. If you are a researcher, the simplest way to study the data is to make use of the SQLite3 database called 'school-data-2016-1.db'. If you are unsure how to use an SQLite database, ask a guru.

The database was constructed directly from the other included files by running the following command at a command-line prompt: sqlite3 school-data-2016-1.db < school-data-2016-1.sql Note that a few, non-consequential, errors will be reported if you run this command yourself. The reason for the errors is that the SQLite database is created by importing a series of '.csv' files. Each of the .csv files contains a header line with the names of the variable relevant to each column. The information is useful for many statistical packages but it is not what SQLite expects, so it complains about the header. Despite the complaint, the database will be created correctly.

Briefly, the data are organized as follows. (a) The .csv files ('comma separated values') do not actually use a comma as the field delimiter. Instead, the vertical bar character '|' (ASCII Octal 174 Decimal 124 Hex 7C) is used. If you read the .csv files using Microsoft Excel, Open Office, or Libre Office, you will need to set the field-separator to be '|'. Check your software documentation to understand how to do this. (b) Each school-related record is indexed by an identifer called 'ageid'. The ageid uniquely identifies each school and consequently serves as the appropriate variable for JOIN-ing records in different data files. For example, the first school-related record after the header line in file 'students-headed-bar.csv' shows the ageid of the school as 40000. The relevant school name can be found by looking in the file 'ageidtoname-headed-bar.csv' to discover that the the ageid of 40000 corresponds to a school called 'Corpus Christi Catholic School'. (3) In addition to the variable 'ageid' each record is also identified by one or two 'year' variables. The most important purpose of a year identifier will be to indicate the year that is relevant to the record. For example, if one turn again to file 'students-headed-bar.csv', one sees that the first seven school-related records after the header line all relate to the school Corpus Christi Catholic School with ageid of 40000. The variable that identifies the important differences between these seven records is the variable 'studentyear'. 'studentyear' shows the year to which the student data refer. One can see, for example, that in 2008, there were a total of 410 students enrolled, of whom 185 were girls and 225 were boys (look at the variable names in the header line). (4) The variables relating to years are given different names in each of the different files ('studentsyear' in the file 'students-headed-bar.csv', 'financesummaryyear' in the file 'financesummary-headed-bar.csv'). Despite the different names, the year variables provide the second-level means for joining information acrosss files. For example, if you wanted to relate the enrolments at a school in each year to its financial state, you might wish to JOIN records using 'ageid' in the two files and, secondarily, matching 'studentsyear' with 'financialsummaryyear'. (5) The manipulation of the data is most readily done using the SQL language with the SQLite database but it can also be done in a variety of statistical packages. (6) It is our intention for Edition 2016-2 to create large 'flat' files suitable for use by non-researchers who want to view the data with spreadsheet software. The disadvantage of such 'flat' files is that they contain vast amounts of redundant information and might not display the data in the form that the user most wants it. (7) Geocoding of the schools is not available in this edition. (8) Some files, such as 'sector-headed-bar.csv' are not used in the creation of the database but are provided as a convenience for researchers who might wish to recode some of the data to remove redundancy. (9) A detailed example of a suitable SQLite query can be found in the file 'school-data-sqlite-example.sql'. The same query, used in the context of analyses done with the excellent, freely available R statistical package (http://www.r-project.org) can be seen in the file 'school-data-with-sqlite.R'.
o
Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1992-2016
openicpsr.org
datasearch.gesis.org
Updated May 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1992-2016 [Dataset]. http://doi.org/10.3886/E103500V3
Explore at:
Unique identifier
https://doi.org/10.3886/E103500V3
Dataset updated
May 18, 2018
Dataset provided by
University of Pennsylvania
Authors
Jacob Kaplan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1992 - 2015
Area covered
United States
Description
Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. The data sets here combine all data from the years 1992-2015 into a single file. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), changed the name of some UCR offense codes (e.g. from "agg asslt" to "aggravated assault"), made all character values lower case, reordered columns. I also added state, county, and place FIPS code from the LEAIC (crosswalk) and generated incident month, weekday, and month-day variables from the incident date variable included in the original data. The zip file contains the data in the following formats and a codebook: .csv - Microsoft Excel.dta - Stata.sav - SPSS.rda - RIf you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

NASA/LARC/SD/ASDC (2025). KORUS-AQ Aircraft Merge Data Files [Dataset]. https://catalog.data.gov/dataset/korus-aq-aircraft-merge-data-files

Data from: KORUS-AQ Aircraft Merge Data Files

Explore at:

Dataset updated

Jul 3, 2025

Dataset provided by

NASAhttp://nasa.gov/

Description

KORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.

Clear search

Close search

Google apps

Main menu

Data from: KORUS-AQ Aircraft Merge Data Files

KORUS-AQ Aircraft Merge Data Files | gimi9.com

R codes and dataset for Visualisation of Diachronic Constructional Change...

Data and tools for studying isograms

Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

CLM16gwl NSW Office of Water_GW licence extract linked to spatial...

Abstract

Dataset History

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

Uniform Crime Reporting Program Data: Offenses Known and Clearances by...

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

Video game pricing analytics dataset

Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1992-2017

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

Data from: Open-data release of aggregated Australian school-level...

Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1992-2016

Data from: KORUS-AQ Aircraft Merge Data FilesSee More Versions

Data from: KORUS-AQ Aircraft Merge Data Files