https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv. These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as>> TS_Init('INP_Empirical1000.mat');Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Canada Trademarks Dataset
18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303
Dataset Selection and Arrangement (c) 2021 Jeremy Sheff
Python and Stata Scripts (c) 2021 Jeremy Sheff
Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.
This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.
Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.
Terms of Use:
As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.
The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:
The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.
Details of Repository Contents:
This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:
If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.
The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.
With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.
The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.
This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘E-Designations: CSV file’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/13ba2f51-b7cd-48fc-86d4-273a0ae3502c on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This data set contains information and locations of (E) Designations, including CEQR Environment Requirements (Table 1) and CEQR Restrictive Declarations (Table 2), in Appendix C of the Zoning Resolution. An (E) Designation provides notice of the presence of an environmental requirement pertaining to potential hazardous materials contamination, high ambient noise levels or air emission concerns on a particular tax lot.
All previously released versions of this data are available at BYTES of the BIG APPLE- Archive
--- Original source retains full ownership of the source dataset ---
description: This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1]. The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language. This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000. For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2. Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed. For questions about this dataset, contact andy.hoke@nrel.gov. If you find this dataset useful, please mention NREL and cite [1] in your work. References: [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders, IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 . [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, Modern Grid Initiative Distribution Taxonomy Final Report, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, Distribution power flow for smart grid technologies, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.; abstract: This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1]. The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language. This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000. For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2. Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed. For questions about this dataset, contact andy.hoke@nrel.gov. If you find this dataset useful, please mention NREL and cite [1] in your work. References: [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders, IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 . [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, Modern Grid Initiative Distribution Taxonomy Final Report, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, Distribution power flow for smart grid technologies, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large go-around, also referred to as missed approach, data set. The data set is in support of the paper presented at the OpenSky Symposium on November the 10th.
If you use this data for a scientific publication, please consider citing our paper.
The data set contains landings from 176 (mostly) large airports from 44 different countries. The landings are labelled as performing a go-around (GA) or not. In total, the data set contains almost 9 million landings with more than 33000 GAs. The data was collected from OpenSky Network's historical data base for the year 2019. The published data set contains multiple files:
go_arounds_minimal.csv.gz
Compressed CSV containing the minimal data set. It contains a row for each landing and a minimal amount of information about the landing, and if it was a GA. The data is structured in the following way:
Column name
Type
Description
time
date time
UTC time of landing or first GA attempt
icao24
string
Unique 24-bit (hexadecimal number) ICAO identifier of the aircraft concerned
callsign
string
Aircraft identifier in air-ground communications
airport
string
ICAO airport code where the aircraft is landing
runway
string
Runway designator on which the aircraft landed
has_ga
string
"True" if at least one GA was performed, otherwise "False"
n_approaches
integer
Number of approaches identified for this flight
n_rwy_approached
integer
Number of unique runways approached by this flight
The last two columns, n_approaches and n_rwy_approached, are useful to filter out training and calibration flight. These have usually a large number of n_approaches, so an easy way to exclude them is to filter by n_approaches > 2.
go_arounds_augmented.csv.gz
Compressed CSV containing the augmented data set. It contains a row for each landing and additional information about the landing, and if it was a GA. The data is structured in the following way:
Column name
Type
Description
time
date time
UTC time of landing or first GA attempt
icao24
string
Unique 24-bit (hexadecimal number) ICAO identifier of the aircraft concerned
callsign
string
Aircraft identifier in air-ground communications
airport
string
ICAO airport code where the aircraft is landing
runway
string
Runway designator on which the aircraft landed
has_ga
string
"True" if at least one GA was performed, otherwise "False"
n_approaches
integer
Number of approaches identified for this flight
n_rwy_approached
integer
Number of unique runways approached by this flight
registration
string
Aircraft registration
typecode
string
Aircraft ICAO typecode
icaoaircrafttype
string
ICAO aircraft type
wtc
string
ICAO wake turbulence category
glide_slope_angle
float
Angle of the ILS glide slope in degrees
has_intersection
string
Boolean that is true if the runway has an other runway intersecting it, otherwise false
rwy_length
float
Length of the runway in kilometre
airport_country
string
ISO Alpha-3 country code of the airport
airport_region
string
Geographical region of the airport (either Europe, North America, South America, Asia, Africa, or Oceania)
operator_country
string
ISO Alpha-3 country code of the operator
operator_region
string
Geographical region of the operator of the aircraft (either Europe, North America, South America, Asia, Africa, or Oceania)
wind_speed_knts
integer
METAR, surface wind speed in knots
wind_dir_deg
integer
METAR, surface wind direction in degrees
wind_gust_knts
integer
METAR, surface wind gust speed in knots
visibility_m
float
METAR, visibility in m
temperature_deg
integer
METAR, temperature in degrees Celsius
press_sea_level_p
float
METAR, sea level pressure in hPa
press_p
float
METAR, QNH in hPA
weather_intensity
list
METAR, list of present weather codes: qualifier - intensity
weather_precipitation
list
METAR, list of present weather codes: weather phenomena - precipitation
weather_desc
list
METAR, list of present weather codes: qualifier - descriptor
weather_obscuration
list
METAR, list of present weather codes: weather phenomena - obscuration
weather_other
list
METAR, list of present weather codes: weather phenomena - other
This data set is augmented with data from various public data sources. Aircraft related data is mostly from the OpenSky Network's aircraft data base, the METAR information is from the Iowa State University, and the rest is mostly scraped from different web sites. If you need help with the METAR information, you can consult the WMO's Aerodrom Reports and Forecasts handbook.
go_arounds_agg.csv.gz
Compressed CSV containing the aggregated data set. It contains a row for each airport-runway, i.e. every runway at every airport for which data is available. The data is structured in the following way:
Column name
Type
Description
airport
string
ICAO airport code where the aircraft is landing
runway
string
Runway designator on which the aircraft landed
n_landings
integer
Total number of landings observed on this runway in 2019
ga_rate
float
Go-around rate, per 1000 landings
glide_slope_angle
float
Angle of the ILS glide slope in degrees
has_intersection
string
Boolean that is true if the runway has an other runway intersecting it, otherwise false
rwy_length
float
Length of the runway in kilometres
airport_country
string
ISO Alpha-3 country code of the airport
airport_region
string
Geographical region of the airport (either Europe, North America, South America, Asia, Africa, or Oceania)
This aggregated data set is used in the paper for the generalized linear regression model.
Downloading the trajectories
Users of this data set with access to OpenSky Network's Impala shell can download the historical trajectories from the historical data base with a few lines of Python code. For example, you want to get all the go-arounds of the 4th of January 2019 at London City Airport (EGLC). You can use the Traffic library for easy access to the database:
import datetime from tqdm.auto import tqdm import pandas as pd from traffic.data import opensky from traffic.core import Traffic
df = pd.read_csv("go_arounds_minimal.csv.gz", low_memory=False) df["time"] = pd.to_datetime(df["time"])
airport = "EGLC" start = datetime.datetime(year=2019, month=1, day=4).replace( tzinfo=datetime.timezone.utc ) stop = datetime.datetime(year=2019, month=1, day=5).replace( tzinfo=datetime.timezone.utc )
df_selection = df.query("airport==@airport & has_ga & (@start <= time <= @stop)")
flights = [] delta_time = pd.Timedelta(minutes=10) for _, row in tqdm(df_selection.iterrows(), total=df_selection.shape[0]): # take at most 10 minutes before and 10 minutes after the landing or go-around start_time = row["time"] - delta_time stop_time = row["time"] + delta_time
# fetch the data from OpenSky Network
flights.append(
opensky.history(
start=start_time.strftime("%Y-%m-%d %H:%M:%S"),
stop=stop_time.strftime("%Y-%m-%d %H:%M:%S"),
callsign=row["callsign"],
return_flight=True,
)
)
Traffic.from_flights(flights)
Additional files
Additional files are available to check the quality of the classification into GA/not GA and the selection of the landing runway. These are:
validation_table.xlsx: This Excel sheet was manually completed during the review of the samples for each runway in the data set. It provides an estimate of the false positive and false negative rate of the go-around classification. It also provides an estimate of the runway misclassification rate when the airport has two or more parallel runways. The columns with the headers highlighted in red were filled in manually, the rest is generated automatically.
validation_sample.zip: For each runway, 8 batches of 500 randomly selected trajectories (or as many as available, if fewer than 4000) classified as not having a GA and up to 8 batches of 10 random landings, classified as GA, are plotted. This allows the interested user to visually inspect a random sample of the landings and go-arounds easily.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock the power of online marketplace analytics with our comprehensive eBay products dataset. This premium collection contains 1.29 million products from eBay's global marketplace, providing extensive insights into one of the world's largest e-commerce platforms. Perfect for competitive analysis, pricing strategies, market research, and machine learning applications in e-commerce.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
format. This dataset provides comprehensive details on a wide range of beauty products listed on Mecca Australia, one of the leading beauty retailers in the country.
Perfect for market researchers, data analysts, and beauty industry professionals, this dataset enables a deep dive into product offerings and trends without the clutter of customer reviews.
With the "Mecca Australia Extracted Data" in CSV format, you can easily access and analyze crucial product data, enabling informed decision-making and strategic planning in the beauty industry.
The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Attrition analysis: Identify factors correlated with attrition like department, role, salary, etc. Segment high-risk employees. Predict future attrition.
Performance management: Analyze the relationship between metrics like ratings, and salary increments. recommend performance improvement programs.
Workforce planning: Forecast staffing needs based on historical hiring/turnover trends. Determine optimal recruitment strategies.
Compensation analysis: Benchmark salaries vs performance, and experience. Identify pay inequities. Inform compensation policies.
Diversity monitoring: Assess diversity metrics like gender ratio over roles, and departments. Identify underrepresented groups.
Succession planning: Identify high-potential candidates and critical roles. Predict internal promotions/replacements in advance.
Given its longitudinal employee data and multiple variables, this dataset provides rich opportunities for exploration, predictive modeling, and actionable insights. With a large sample size, it can uncover subtle patterns. Cleaning, joining with other contextual data sources can yield even deeper insights. This makes it a valuable starting point for many organizational studies and evidence-based decision-making.
.............................................................................................................................................................................................................................................
This dataset contains information about different attributes of employees from a company. It includes 1000 employee records and 12 feature columns.
satisfaction_level: Employee satisfaction score (1-5 scale) last_evaluation: Score on last evaluation (1-5 scale) number_project: Number of projects employee worked on average_monthly_hours: Average hours worked in a month time_spend_company: Number of years spent with the company work_accident: If an employee had a workplace accident (yes/no) left: If an employee has left the company (yes/no) promotion_last_5years: Number of promotions in last 5 years Department: Department of the employee Salary: Annual salary of employee satisfaction_level: Employee satisfaction level (1-5 scale) last_evaluation: Score on last evaluation (1-5 scale)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains key characteristics about the data described in the Data Descriptor The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms. Contents:
1. human readable metadata summary table in CSV format
2. machine readable metadata file in JSON format
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.
This large dataset is ideal for:
Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.
The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We extend our previous work with the Yost Index by adding 90% confidence intervals to the index values. These were calculated using the variance replicate estimates published in association with the American Community Survey of the United States Census Bureau.
In the file yost-tract-2015-2019.csv, the data fields consists of 11-digit geographic ID built from FIPS codes (2 digit state, 3 digit county, 6 digit census tract); Yost index, 90% lower confidence interval; 90% upper confidence interval. Data is provided for 72,793 census tracts for which sufficient data were available. The Yost Index ranges from 1 (lowest socioeconomic position) to 100 (highest socioeconomic position).
For those only interested in using the index as we have calculated it, the file yost-tract-2015-2019 is the only file you need. The other 368 files here are provided for anyone who wishes to replicate our results using the R program yost-conf-intervals.R. The program presumes the user is running Windows machine and that all files reside in a folder called C:/yostindex. The R program requires a number of packages, all of which are specified in lines 10-22 of the program.
Details of this project were published in Boscoe FP, Liu B, LaFantasie J, Niu L, Lee FF. Estimating uncertainty in a socioeconomic index derived from the American Community Survey. SSM-Population Health 2022; 18: 101078. Full text
Additional years of data following this format are planned to be added to this repository in time.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Having been trained in the wild, Large Language Models (LLMs) may suffer from different types of bias. As shown in previous studies outside software engineering, this includes a language bias, i.e., these models perform differently depending on the language used for the query/prompt. However, so far the impact of language bias on source code generation has not been thoroughly investigated. Therefore, in this paper, we study the influence of the language adopted in the prompt on the quality of the source code generated by three LLMs, specifically GPT, Claude, and DeepSeek. We consider 230 coding tasks for Python and 230 for Java, and translate their related prompts into four languages: Chinese, Hindi, Spanish, and Italian. After generating the code, we measure code quality in terms of passed tests, code metrics, warnings generated by static analysis tools, and language used for the identifiers. Results indicate that (i) source code generated from the English queries is not necessarily better in terms of passed test and quality metrics, (ii) the quality for different languages varies depending on the programming language and LLM being used, and (iii) the generated code tend to contain mixes of comments and literals written in English and the language used to formulate the prompt.
This replication package is organized into two main directories: data and scripts. The datadirectory contains all the data used in the analysis, including prompts and final results. The scripts directory contains all the Python scripts used for code generation and analysis.
The data directory contains five subdirectories, each corresponding to a stage in the analysis pipeline. These are enumerated to reflect the order of the process:
prompt_translation: Contains files with manually translated prompts for each language. Each file is associated with both Python and Java. The structure of each file is as follows:
id
: The ID of the query in the CoderEval benchmark.prompt
: The original English prompt.summary
: The original summary.code
: The original code.translation
: The translation generated by GPT.correction
: The manual correction of the GPT-generated translation.correction_tag
: A list of tags indicating the corrections made to the translation.generated_code
: This column is initially empty and will contain the code generated from the translated prompt.generation: Contains the code generated by the three LLMs for each programming language and natural language. Each subdirectory (e.g., java_chinese_claude
) contains the following:
java_chinese_claude.csv
) containing the generated code in the corresponding column.tests: Contains input files for the testing process and the results of the tests. Files in the input_files directory are formatted according to the CoderEval benchmark requirements. The results directory holds the output of the testing process.
quantitative_analysis: Contains all the csv reports of the static analysis tools and test output for all languages and models. These files are the inputs for the statistical analysis. The directory stats contains all the output tables for the statistical analysis, which are shown in paper's tables.
qualitative_analysis: Contains files used for the qualitative analysis:
id
: The ID of the query in the CoderEval benchmark.generated_code
: The code generated by the model.comments
: The language used for comments.identifiers
: The language used for identifiers.literals
: The language used for literals.notes
: Additional notes.ablation_study: Contains files for the ablation study. Each file has the following columns:
id
: The ID of the query in the CoderEval benchmark.prompt
: The prompt used for code generation.generated_code
, comments
, identifiers
, and literals
: Same as in the qualitative analysis. results.pdf
: This file shows the table containing all the percentages of comments, identifiers and literals extracted from the csv files of the ablation study.Files prefixed with italian
contain prompts with signatures and docstrings translated into Italian. The system prompt used is the same as the initial one (see the paper). Files with the english
prefix have prompts with the original signature (in English) and the docstring in Italian. The system prompt differs as follows:
You are an AI that only responds with Python code. You will be given a function signature and its docstring by the user. Write your full implementation (restate the function signature).
Use a Python code block to write your response.
Comments and identifiers must be in Italian.
For example:
```python
print("Hello World!")
The scripts directory contains all the scripts used to perform all the generations and analysis. All files are properly commented. Here a brief description of each file:
code_generation.py: This script automates code generation using AI models (GPT, DeepSeek, and Claude) for different programming and natural languages. It reads prompts from CSV files, generates code based on the prompts, and saves the results in structured directories. It logs the process, handles errors, and stores the generated code in separate files for each iteration.
computeallanalysis.py: This script performs static code analysis on generated code files using different models, languages, and programming languages. It runs various analyses (Flake8, Pylint, Lizard) depending on the programming language: for Python, it runs all three analyses, while for Java, only Lizard is executed. The results are stored in dedicated report directories for each iteration. The script ensures the creation of necessary directories and handles any errors that occur during the analysis process.
createtestjava.py: This script processes Java code generated by different models and languages, extracting methods using a JavaParser server. It iterates through multiple iterations of generated code, extracts the relevant method code (or uses the full code if no method is found), and stores the results in a JSONL file for each language and model combination.
deepseek_model.py: This function sends a request to the DeepSeek API, passing a system and user prompt, and extracts the generated code snippet based on the specified programming language. It prints the extracted code in blue to the console, and if any errors occur during the request or extraction, it prints an error message in red. If successful, it returns the extracted code snippet; otherwise, it returns None.
extractpmdreport.py: This script processes PMD analysis reports in SARIF format and converts them into CSV files. It extracts the contents of ZIP files containing the PMD reports, parses the SARIF file to gather analysis results, and saves the findings in a CSV file. The output includes details such as file names, rules, messages, and the count of issues found. The script iterates through multiple languages, models, and iterations, ensuring that PMD reports are properly processed and saved for each combination.
flake_analysis.py: The flake_analysis function runs Flake8 to analyze Python files for errors and generates a CSV report summarizing the results. It processes the output, extracting error details such as filenames, error codes, and messages. The errors are grouped by file and saved in a CSV file for easy review.
generatepredictionclaude_java.py: The generatecodefrom_prompt function processes a JSON file containing prompts, generates Java code using the Claude API, and saves the generated code to a new JSON file. It validates each prompt, ensures it's JSON-serializable, and sends it to the Claude API for code generation. If the generation is successful, the code is stored in a structured format, and the output is saved to a JSON file for further use.
generatepredictionclaude_python.py: This code defines a function generatecodefrom_prompt that processes a JSON file containing prompts, generates Python code using the Claude API, and saves the generated code to a new JSON file. It handles invalid values and ensures all prompts are
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive contains raw annual land cover maps, cropland abandonment maps, and accompanying derived data products to support:
Crawford C.L., Yin, H., Radeloff, V.C., and Wilcove, D.S. 2022. Rural land abandonment is too ephemeral to provide major benefits for biodiversity and climate. Science Advances doi.org/10.1126/sciadv.abm8999.
An archive of the analysis scripts developed for this project can be found at: https://github.com/chriscra/abandonment_trajectories (https://doi.org/10.5281/zenodo.6383127).
Note that the label "_2022_02_07" in many file names refers to the date of the primary analysis. "dts” or “dt” refer to “data.tables," large .csv files that were manipulated using the data.table package in R (Dowle and Srinivasan 2021, http://r-datatable.com/). “Rasters” refer to “.tif” files that were processed using the raster and terra packages in R (Hijmans, 2022; https://rspatial.org/terra/; https://rspatial.org/raster).
Data files fall into one of four categories of data derived during our analysis of abandonment: observed, potential, maximum, or recultivation. Derived datasets also follow the same naming convention, though are aggregated across sites. These four categories are as follows (using “age_dts” for our site in Shaanxi Province, China as an example):
observed abandonment identified through our primary analysis, with a threshold of five years. These files do not have a specific label beyond the description of the file and the date of analysis (e.g., shaanxi_age_2022_02_07.csv);
potential abandonment for a scenario without any recultivation, in which abandoned croplands are left abandoned from the year of initial abandonment through the end of the time series, with the label “_potential” (e.g., shaanxi_potential_age_2022_02_07.csv);
maximum age of abandonment over the course of the time series, with the label “_max” (e.g., shaanxi_max_age_2022_02_07.csv);
recultivation periods, corresponding to the lengths of recultivation periods following abandonment, given the label “_recult” (e.g., shaanxi_recult_age_2022_02_07.csv).
This archive includes multiple .zip files, the contents of which are described below:
age_dts.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for, as of that year, also referred to as length, duration, etc.), for each year between 1987-2017 for all 11 sites. These maps are stored as .csv files, where each row is a pixel, the first two columns refer to the x and y coordinates (in terms of longitude and latitude), and subsequent columns contain the abandonment age values for an individual year (where years are labeled with "y" followed by the year, e.g., "y1987"). Maps are given with a latitude and longitude coordinate reference system. Folder contains observed age, potential age (“_potential”), maximum age (“_max”), and recultivation lengths (“_recult”) for all sites. Maximum age .csv files include only three columns: x, y, and the maximum length (i.e., “max age”, in years) for each pixel throughout the entire time series (1987-2017). Files were produced using the custom functions "cc_filter_abn_dt()," “cc_calc_max_age()," “cc_calc_potential_age(),” and “cc_calc_recult_age();” see "_util/_util_functions.R."
age_rasters.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for), for each year between 1987-2017 for all 11 sites. Maps are stored as .tif files, where each band corresponds to one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Folder contains observed age, potential age (“_potential”), and maximum age (“_max”) rasters for all sites. Maximum age rasters include just one band (“layer”). These rasters match the corresponding .csv files contained in "age_dts.zip.”
derived_data.zip - summary datasets created throughout this analysis, listed below.
diff.zip - .csv files for each of our eleven sites containing the year-to-year lagged differences in abandonment age (i.e., length of time abandoned) for each pixel. The rows correspond to a single pixel of land, and the columns refer to the year the difference is in reference to. These rows do not have longitude or latitude values associated with them; however, rows correspond to the same rows in the .csv files in "input_data.tables.zip" and "age_dts.zip." These files were produced using the custom function "cc_diff_dt()" (much like the base R function "diff()"), contained within the custom function "cc_filter_abn_dt()" (see "_util/_util_functions.R"). Folder contains diff files for observed abandonment, potential abandonment (“_potential”), and recultivation lengths (“_recult”) for all sites.
input_dts.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment (https://doi.org/10.1016/j.rse.2020.111873). Like “age_dts,” these maps are stored as .csv files, where each row is a pixel and the first two columns refer to x and y coordinates (in terms of longitude and latitude). Subsequent columns contain the land cover class for an individual year (e.g., "y1987"). Note that these maps were recoded from Yin et al. 2020 so that land cover classification was consistent across sites (see below). This contains two files for each site: the raw land cover maps from Yin et al. 2020 (after recoding), and a “clean” version produced by applying 5- and 8-year temporal filters to the raw input (see custom function “cc_temporal_filter_lc(),” in “_util/_util_functions.R” and “1_prep_r_to_dt.R”). These files correspond to those in "input_rasters.zip," and serve as the primary inputs for the analysis.
input_rasters.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment. Maps are stored as ".tif" files, where each band corresponds one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Maps are given with a latitude and longitude coordinate reference system. Note that these maps were recoded so that land cover classes matched across sites (see below). Contains two files for each site: the raw land cover maps (after recoding), and a “clean” version that has been processed with 5- and 8-year temporal filters (see above). These files match those in "input_dts.zip."
length.zip - .csv files containing the length (i.e., age or duration, in years) of each distinct individual period of abandonment at each site. This folder contains length files for observed and potential abandonment, as well as recultivation lengths. Produced using the custom function "cc_filter_abn_dt()" and “cc_extract_length();” see "_util/_util_functions.R."
derived_data.zip contains the following files:
"site_df.csv" - a simple .csv containing descriptive information for each of our eleven sites, along with the original land cover codes used by Yin et al. 2020 (updated so that all eleven sites in how land cover classes were coded; see below).
Primary derived datasets for both observed abandonment (“area_dat”) and potential abandonment (“potential_area_dat”).
area_dat - Shows the area (in ha) in each land cover class at each site in each year (1987-2017), along with the area of cropland abandoned in each year following a five-year abandonment threshold (abandoned for >=5 years) or no threshold (abandoned for >=1 years). Produced using custom functions "cc_calc_area_per_lc_abn()" via "cc_summarize_abn_dts()". See scripts "cluster/2_analyze_abn.R" and "_util/_util_functions.R."
persistence_dat - A .csv containing the area of cropland abandoned (ha) for a given "cohort" of abandoned cropland (i.e., a group of cropland abandoned in the same year, also called "year_abn") in a specific year. This area is also given as a proportion of the initial area abandoned in each cohort, or the area of each cohort when it was first classified as abandoned at year 5 ("initial_area_abn"). The "age" is given as the number of years since a given cohort of abandoned cropland was last actively cultivated, and "time" is marked relative to the 5th year, when our five-year definition first classifies that land as abandoned (and where the proportion of abandoned land remaining abandoned is 1). Produced using custom functions "cc_calc_persistence()" via "cc_summarize_abn_dts()". See scripts "cluster/2_analyze_abn.R" and "_util/_util_functions.R." This serves as the main input for our linear models of recultivation (“decay”) trajectories.
turnover_dat - A .csv showing the annual gross gain, annual gross loss, and annual net change in the area (in ha) of abandoned cropland at each site in each year of the time series. Produced using custom functions "cc_calc_abn_diff()" via "cc_summarize_abn_dts()" (see "_util/_util_functions.R"), implemented in "cluster/2_analyze_abn.R." This file is only produced for observed abandonment.
Area summary files (for observed abandonment only)
area_summary_df - Contains a range of summary values relating to the area of cropland abandonment for each of our eleven sites. All area values are given in hectares (ha) unless stated otherwise. It contains 16 variables as columns, including 1) "site," 2) "total_site_area_ha_2017" - the total site area (ha) in 2017, 3) "cropland_area_1987" - the area in cropland in 1987 (ha), 4) "area_abn_ha_2017" - the area of cropland abandoned as of 2017 (ha), 5) "area_ever_abn_ha" - the total area of those pixels that were abandoned at least once during the time series (corresponding to the area of potential abandonment, as of 2017), 6) "total_crop_extent_ha" - the total area of those pixels that were classified as cropland at least once during the time series, 7)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides essential information for entries related to question answering tasks using AI models. It is designed to offer valuable insights for researchers and practitioners, enabling them to effectively train and rigorously evaluate their machine learning models. The dataset serves as a valuable resource for building and assessing question-answering systems. It is available free of charge.
The data files are typically in CSV format, with a dedicated train.csv
file for training data and a test.csv
file for testing purposes. The training file contains a large number of examples. Specific dates are not included within this dataset description, focusing solely on providing accurate and informative details about its content and purpose. Specific numbers for rows or records are not detailed in the available information.
This dataset is ideal for a variety of applications and use cases:
* Training and Testing: Utilise train.csv
to train question-answering models or algorithms, and test.csv
to evaluate their performance on unseen questions.
* Machine Learning Model Creation: Develop machine learning models specifically for question-answering by leveraging the instructional components, including instructions, responses, next responses, and human-generated answers, along with their is_human_response
labels.
* Model Performance Evaluation: Assess model performance by comparing predicted responses with actual human-generated answers from the test.csv
file.
* Data Augmentation: Expand existing data by paraphrasing instructions or generating alternative responses within similar contexts.
* Conversational Agents: Build conversational agents or chatbots by utilising the instruction-response pairs for training.
* Language Understanding: Train models to understand language and generate responses based on instructions and previous responses.
* Educational Materials: Develop interactive quizzes or study guides, with models providing instant feedback to students.
* Information Retrieval Systems: Create systems that help users find specific answers from large datasets.
* Customer Support: Train customer support chatbots to provide quick and accurate responses to inquiries.
* Language Generation Research: Develop novel algorithms for generating coherent responses in question-answering scenarios.
* Automatic Summarisation Systems: Train systems to generate concise summaries by understanding main content through question answering.
* Dialogue Systems Evaluation: Use the instruction-response pairs as a benchmark for evaluating dialogue system performance.
* NLP Algorithm Benchmarking: Establish baselines against which other NLP tools and methods can be measured.
The dataset's geographic scope is global. There is no specific time range or demographic scope noted within the available details, as specific dates are not included.
CC0
This dataset is highly suitable for: * Researchers and Practitioners: To gain insights into question answering tasks using AI models. * Developers: To train models, create chatbots, and build conversational agents. * Students: For developing educational materials and enhancing their learning experience through interactive tools. * Individuals and teams working on Natural Language Processing (NLP) projects. * Those creating information retrieval systems or customer support solutions. * Experts in natural language generation (NLG) and automatic summarisation systems. * Anyone involved in the evaluation of dialogue systems and machine learning model training.
Original Data Source: Question-Answering Training and Testing Data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.
Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)
Types of Products : Clothing , Sports , and Electronic Supplies
Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
JSON files from each site and depth (site_depth). CSV files exported from Taglab data were merged into one large file for further analysis.
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.