Facebook
TwitterThis module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
Facebook
TwitterThis research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.
What can you do with the data?
I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:
Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.
Data set size:
For the data, it uses the following fields (DS = Data Scientist, W = Workstation):
You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This large dataset (+8 million obs) contains job descriptions and rankings among various criteria such as work-life balance, income, culture, etc.
This data set complements the Glassdoor dataset located [here].(https://www.kaggle.com/datasets/davidgauthier/glassdoor-job-reviews)
Please cite as: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FQstpaoAAAAJ&citation_for_view=FQstpaoAAAAJ:UebtZRa9Y70C
Glassdoor Reviews Glassdoor produces reports based upon the data collected from its users, on topics including work–life balance, CEO pay ratios, lists of the best office places and cultures, and the accuracy of corporate job searching maxims. Data from Glassdoor has also been used by outside sources to produce estimates on the effects of salary trends and changes on corporate revenues. Glassdoor also puts the conclusions of its research of other companies towards its company policies. In 2015, Tom Lakin produced the first study of Glassdoor in the United Kingdom, concluding that Glassdoor is regarded by users as a more trustworthy source of information than career guides or official company documents.
Features The columns correspond to the date of the review, the job name, the job location, the status of the reviewers, and the reviews. Reviews are divided into sub-categories Career Opportunities, Comp & Benefits, Culture & Values, Senior Management, and Work/Life Balance. In addition, employees can add recommendations on the firm, the CEO, and the outlook.
Other information Ranking for the recommendation of the firm, CEO approval, and outlook are allocated categories v, r, x, and o, with the following meanings: v - Positive, r - Mild, x - Negative, o - No opinion
Some examples of the textual data entries MCDONALD-S I don't like working here,don't work here Headline: I don't like working here,don't work here Pros: Some people are nice,some free food,some of the managers are nice about 95% of the time Cons: 95% of people are mean to employees/customers,its not a clean place,people barely clean their hands of what i see,managers are mean,i got a stress rash because of this i can't get rid of it,they don't give me a little raise even though i do alot of crap there for them Rating: 1.0
KPMG Quit working people to death Headline: Quit working people to death Pros: Lots of PTO, Good company training Cons: long hours, clear disconnect between management and staff, as corporate as it gets Rating: 2.0
PRIMARK Sales assistant Headline: Sales assistant Pros: Lovely staff, managers are also very nice Cons: Hardwork, often rude customers, underpaid for u18 Rating: 3.0
J-P-MORGAN Life in JPM, Bangalore Headline: Life in JPM, Bangalore Pros: Good place to start, lots of opportunity. Cons: Be ready to put in a lot of effort not a place to chill out. Rating: 4.0
VODAFONE Good to be here Headline: Good to be here Pros: Fast moving with technology. Leading Cons: There are areas you may want to avoid Rating: 5.0
Facebook
Twitteranalyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This large dataset contains job descriptions and rankings among various criteria such as work-life balance, income, culture, etc. The data covers the various industries in the UK. Great dataset for multidimensional sentiment analysis.
This data set complements the Glassdoor dataset located [here].(https://www.kaggle.com/datasets/davidgauthier/glassdoor-job-reviews-2)
Please cite as: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FQstpaoAAAAJ&citation_for_view=FQstpaoAAAAJ:UebtZRa9Y70C
Glassdoor produces reports based upon the data collected from its users, on topics including work–life balance, CEO pay ratios, lists of the best office places and cultures, and the accuracy of corporate job searching maxims. Data from Glassdoor has also been used by outside sources to produce estimates on the effects of salary trends and changes on corporate revenues. Glassdoor also puts the conclusions of its research of other companies towards its own company policies. In 2015, Tom Lakin produced the first study of Glassdoor in the United Kingdom, concluding that Glassdoor is regarded by users as a more trustworthy source of information than career guides or official company documents.
The columns correspond to the date of the review, the job name, the job location, the status of the reviewers, and the reviews. Reviews are divided in s sub-categories Career Opportunities, Comp & Benefits, Culture & Values, Senior Management, and Work/Life Balance. In addition, employees can add recommendations on the firm, the CEO, and the outlook.
Ranking for the recommendation of the firm, CEO approval, and outlook are allocated categories v, r, x, and o, with the following meanings: v - Positive, r - Mild, x - Negative, o - No opinion
MCDONALD-S I don't like working here,don't work here Headline: I don't like working here,don't work here Pros: Some people are nice,some free food,some of the managers are nice about 95% of the time Cons: 95% of people are mean to employees/customers,its not a clean place,people barely clean their hands of what i see,managers are mean,i got a stress rash because of this i can't get rid of it,they don't give me a little raise even though i do alot of crap there for them Rating: 1.0
KPMG Quit working people to death Headline: Quit working people to death Pros: Lots of PTO, Good company training Cons: long hours, clear disconnect between management and staff, as corporate as it gets Rating: 2.0
PRIMARK Sales assistant Headline: Sales assistant Pros: Lovely staff, managers are also very nice Cons: Hardwork, often rude customers, underpaid for u18 Rating: 3.0
J-P-MORGAN Life in JPM, Bangalore Headline: Life in JPM, Bangalore Pros: Good place to start, lots of opportunity. Cons: Be ready to put in a lot of efforts not a place to chill out. Rating: 4.0
VODAFONE Good to be here Headline: Good to be here Pros: Fast moving with technology. Leading Cons: There are areas you may want to avoid Rating: 5.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 2 rows and is filtered where the author is R. Brad Long. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
With the increasing size and complexity of biogeographical and phylogenetic data, it is important to provide functions that allow the manipulation of these datasets efficiently and rapidly. The phylogrid package aims to facilitate the spatial analysis of measures of diversity and endemism. Our package provide a set of functions to join the results derived from species distribution models (SDMs) with phylogenies and endemism data. The functions are focused on pre-processing, processing and post-processing of macroecological and phylogenetic data. The pre-processing step offers basic functions for preparing the data before running the analyses. The processing step brings together functions to calculate Faith’s phylogenetic diversity, phylogenetic endemism, weighted endemism and evolutionary distinctiveness. This step also provides functions to calculate standardized effect size for each metric through different methods of spatial and phylogenetic randomization, aiming to control for richness effects. The post processing stage includes functions to calculate the delta of metrics between different times (e.g. present and future). We have shown that the package has a slightly longer computation time than comparable packages, but takes up a considerably smaller portion of RAM memory, which will allow users to work with high-resolution datasets from local to global scales. This enhances the application of the package by enabling users to work with large datasets on computers with less RAM available.
Facebook
TwittereBird is among the world’s largest biodiversity-related science projects, with more than 100 million bird sightings contributed annually by eBirders around the world and an average participation growth rate of approximately 20% year over year. eBird is managed by the Cornell Lab of Ornithology. Some data has been contributed under INTAROS WP4. eBird provides open data access in several formats to logged-in users, ranging from raw data to processed datasets geared toward more rigorous scientific modeling. eBird Basic Dataset (EBD) The EBD is the core dataset for accessing all raw eBird observations and associated metadata. The EBD is updated monthly (15th of each month), and is available by direct download through eBird to any logged-in user after completion of a data request form. The data request form allows us to gain some understanding of how the data will be used. Requests are typically approved within 7 days. Data are provided with documentation in spreadsheet format, which can be read by a variety of programs. Although Excel or similar programs work for basic analyses, for larger datasets (>1 million rows) or more sophisticated analyses, we recommend using programs like R. There are several R packages available for summarizing data, including one that is managed here at the Cornell Lab specifically for working with the EBD dataset. The data collection may enable a better understanding of bird population dynamics and the status of bird species including bird conservation management requirements.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
R-HORIZON
How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
📃 Paper • 🌐 Project Page • 🤗 Dataset
R-HORIZON is a novel method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs) through query composition. We transform isolated problems into complex multi-step reasoning scenarios, revealing that even the most advanced LRMs suffer significant performance degradation when facing interdependent problems that span… See the full description on the dataset page: https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.\r \r \r
IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.\r \r \r
IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform\r \r
\r The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.\r \r * Patents\r * Trade Marks\r * Designs\r * Plant Breeder’s Rights\r \r \r
\r
\r Due to the changes in our systems, some tables have been affected.\r \r * We have added IPGOD 225 and IPGOD 325 to the dataset!\r * The IPGOD 206 table is not available this year.\r * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use.\r \r
\r Data quality has been improved across all tables.\r \r * Null values are simply empty rather than '31/12/9999'.\r * All date columns are now in ISO format 'yyyy-mm-dd'.\r * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0.\r * All tables are encoded in UTF-8.\r * All tables use the backslash \ as the escape character.\r * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Hi everyone! Iʼm currently in the 8th course of the Data Analysis program on Coursera, and this is my first case study completed entirely on my own (following the steps provided throughout the course). I chose to work with R because it was one of the tools that interested me the most during my learning journey. In my opinion, itʼs incredibly powerful — you can handle large datasets and create visualizations with just a few commands. Iʼm really happy with the final result and I feel I learned a lot throughout the process. Iʼve truly fallen in love with data and its power to create change. Thanks for reading, and Iʼd be happy to receive any feedback!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Reddit [source]
This dataset, labeled as Reddit Technology Data, provides thorough insights into the conversations and interactions around technology-related topics shared on Reddit – a well-known Internet discussion forum. This dataset contains titles of discussions, scores as contributed by users on Reddit, the unique IDs attributed to different discussions, URLs of those hidden discussions (if any), comment counts in each discussion thread and timestamps of when those conversations were initiated. As such, this data is supremely valuable for tech-savvy people wanting to stay up to date with the new developments in their field or professionals looking to keep abreast with industry trends. In short, it is a repository which helps people make sense and draw meaning out of what’s happening in the technology world at large - inspiring action on their part or simply educating them about forthcoming changes
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The dataset includes six columns containing title, score, url address link to the discussion page on Reddit itself ,comment count ,created time stamp meaning when was it posted/uploaded/communicated and body containing actual text written regarding that particular post/discussion. By separately analyzing each column it can be made out what type information user require in regard with various aspects related to technology based discussions. One can develop hypothesis about correlations between different factors associated with rating or comment count by separate analysis within those columns themselves like discuss what does people comment or react mostly upon viewing which type of post inside reddit ? Does high rating always come along with extremely long comments.? And many more .By researching this way one can discover real facts hidden behind social networking websites such as reddit which contains large amount of rich information regarding user’s interest in different topics related to tech gadgets or otherwise .We can analyze different trends using voice search technology etc in order visualize users overall reaction towards any kind of information shared through public forums like stack overflow sites ,facebook posts etc .These small instances will allow us gain heavy insights for research purpose thereby providing another layer for potential business opportunities one may benefit from over a given period if not periodcally monitored .
- Companies can use this dataset to create targeted online marketing campaigns directed towards Reddit users interested in specific areas of technology.
- Academic researchers can use the data to track and analyze trends in conversations related to technology on Reddit over time.
- Technology professionals can utilize the comments and discussions on this dataset as a way of gauging public opinion and consumer sentiment towards certain technological advancements or products
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: technology.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of the discussion. (String) | | score | The score of the discussion as measured by Reddit contributors. (Integer) | | url | The website URL associated with the discussion. (String) | | comms_num | The number of comments associated with the discussion. (Integer) | | created | The date and time the discussion was created. (DateTime) | | body | The body content of the discussion. (String) | | timestamp | The timestamp of the discussion. (Integer) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters. Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z
Facebook
TwitterMultivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.
Facebook
TwitterDataset Card for FIB
Dataset Summary
The FIB benchmark consists of 3579 examples for evaluating the factual inconsistency of large language models. Each example consists of a document and a pair of summaries: a factually consistent one and a factually inconsistent one. It is based on documents and summaries from XSum and CNN/DM.
Since this dataset is intended to evaluate the factual inconsistency of large language models, there is only a test split.
Accuracies should be… See the full description on the dataset page: https://huggingface.co/datasets/r-three/fib.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Available variables in KNMI-LENTIS
request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt
Where is the data deposited on the ECWMF's tape storage (section 4)
LENTIS_on_ECFS.zip
Data of all variables for 1 year for 1 ensemble member (section 5)
tree_of_files_one_member_all_data.txt
{AERmon,Amon,Emon,LImon,Lmon,Ofx,Omon,SImon,fx,Eday,Oday,day,CFday,3hr,6hrPlev,6hrPlevPt}.zip
This Zenodo dataset pertains to the full KNMI-LENTIS dataset: a large ensemble of simulations with the Global Climate Model EC-Earth3. The periods are for the present-day period (2000-2009) and a future +2K period (2075-2084 following SSP2-4.5). KNMI-LENTIS has 1600 simulated years for both the two climates. This level of sampled climate variability allows for robust and in-depth research into extreme events. The available variables are listed in the file request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt. All variables are cmorised following CMIP6 data format convention. Further details on the variables and their output dimensions is available via the following search tool. The total size of KNMI-LENTIS is 128 TB. KNMI-LENTIS is stored at the high performance storage system of the ECMWF (ECFS).
The Global Climate Model that is used for generating this Large Ensemble is EC-Earth3 - VAREX project branch https://svn.ec-earth.org/ecearth3/branches/projects/varex (access restricted to ECMWF members).
The goal of this Zenodo dataset is :
to provide an accurate description and example of how the KNMI-LENTIS dataset is organised.
to describe in which servers the data are deposited and how to gain access to the data for future users
to provide links to related git repositories and other content relating to the KNMI-LENTIS production
KNMI-LENTIS consists of 2 times 160 runs of 10 years. All simulations have a unique ensemble member label that reflects the forcing, and how the initial conditions are generated. The initial conditions have two aspects: the parent simulation from which the run is branched (macro perturbation, there are 16), and the seed relating to a particular micro-perturbation in the initial three-dimensional atmosphere temperature field (there are 10). The ensemble member label thus is a combination of:
forcing (h for present-day/historical and s for +2K/SSP2-4.5)
parent ID (number between 1 and 16)
micro perturbation ID (number between 0 and 9)
In this Zenodo dataset we publish 1 year from 1 member to give insight into the type of data and metadata that is representative of the full KNMI-LENTIS dataset. The published data is year 2000 from member h010. See Section 4
Further, all KNMI-LENTIS simulations are labeled per the CMIP6 convention of variant labelling. A variant label is made from four components: the realization index r, the initialization index i, the physics index p and the forcing index f. Further details on CMIP6 variant labelling be found in The CMIP6 Participation Guidance for Modelers. In the KNMI-LENTIS data set, the forcing is reflected in the first digit of the realization index r of the variant label. For the historical simulations, the one thousands (r1000-r1999) have been reserved. For the SSP2-4.5 the five thousands (r5000-r5999) have been reserved. The parent is reflected in the second and third digit of the realization index r of the variant label (r?01?-r?16?). The seed is reflected in the fourth digit of the realization index r: (r???0-r???9). The seed is also reflected in the initialization index i of the variant label (i0-i9), so this is double information. The physics index p5 has been reserved for the ECE3p5 version: all KNMI-LENTIS simulations have the p5 label. The forcing index f of the variant label is kept at 1 for all KNMI-LENTIS simulations. As an example, variant label r5119i9p5f1 refers to: the 2K time slice with parent 11 and randomizing seed number 9. The physics index is 5, meaning the run is done with the ECE3p5 version of EC-Earth3.
In this Zenodo folder, there are several text files and several netcdf files. The text files provide
Data from KNMI-LENTIS is deposited in the ECMWF ECFS tape storage system. Data can be freely downloaded by to those who have access to the ECMWF ECFS. Else, the data can be made available by the authors upon request.
The way the dataset is organised is detailed in LENTIS_on_ECFS.zip. This contains details on all available KNMI-LENTIS files, in particular details for how these are filed in ECFS. The files on ECFS are tar zipped per ensemble member & variable: these contain 10 years of ensemble member data (10 separate netcdf files). The location on ECFS of the tar-zipped files that are listed in the various text files in this Zenodo dataset is
ec:/nklm/LENTIS/ec-earth/cmorised_by_var/
for freq in AERmon Amon Emon LImon Lmon Ofx Omon SImon fx Eday Oday day CFday 3hr 6hrPlev 6hrPlevPt; do for scen in hxxx sxxx; do els -l ec:/nklm/LENTIS/ec-earth/cmorised_by_var/${scen}/${freq}/* >> LENTIS_on_ECFS_${scen}_${freq}.txt done done
Further, part of the data will be made publicly available from the Earth System Grid Federation (ESGF) data portal. We aim to upload most of the monthly variables for the full ensemble. As search terms use EC-Earth for model and p5 for physical index to locate the KNMI-LENTIS data.
The netcdf files of the data of 1 year from 1 member h010 are published here to give insight into the type of data and metadata that is representative of the full KNMI-LENTIS dataset. The data are in zipped folders per output frequencies: AERmon, Amon, Emon, LImon, Lmon, Ofx, Omon, SImon, fx, Eday, Oday, day, CFday, 3hr, 6hrPlev, 6hrPlevPt. The text file request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt gives an overview of variables available per output frequency. the text files tree_of_files_one_member_all_data.txt gives an overview of the files in the zipped folders.
The production of the KNMI-LENTIS ensemble was funded by the KNMI (Royal Dutch Meteorological Institute) multi-year strategic research fund KNMI MSO Climate Variability And Extremes (VAREX)
GitHub repository corresponding to this Zenodo dataset: https://github.com/lmuntjewerf/KNMI-LENTIS_dataset_description.git
Github repository for KNMI-LENTIS production code: https://github.com/lmuntjewerf/KNMI-LENTIS_production_script_train.git
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Publication
will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
Facebook
TwitterThis module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.