WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Publicly accessible databases often impose query limits or require registration. Even when I maintain public and limit-free APIs, I never wanted to host a public database because I tend to think that the connection strings are a problem for the user.
I’ve decided to host different light/medium size by using PostgreSQL, MySQL and SQL Server backends (in strict descending order of preference!).
Why 3 database backends? I think there are a ton of small edge cases when moving between DB back ends and so testing lots with live databases is quite valuable. With this resource you can benchmark speed, compression, and DDL types.
Please send me a tweet if you need the connection strings for your lectures or workshops. My Twitter username is @pachamaltese. See the SQL dumps on each section to have the data locally.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Available functions in rEHR.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SQL program. Program written in SQL performing the six queries on the MySQL database. (SQL 15.3 kb)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset includes YouTube trending videos statistics for Mediterranean countries on 2022-11-07. It contains 15 columns and it's related to 19 countries:
IT - Italy ES - Spain GR - Greece HR - Croatia TR - Turkey AL - Albania DZ - Algeria EG - Egypt LY - Lybia TN - Tunisia MA - Morocco IL - Israel ME - Montenegro LB - Lebanon FR - France BA - Bosnia and Herzegovina MT - Malta SI - Slovenia CY - Cyprus
The columns are, instead, the following:
country: where is the country in which the video was published. video_id: video identification number. Each video has one. You can find it clicking on a video with the right button and selecting 'stats for nerds'. title: title of the video. publishedAt: publication date of the video. channelId: identification number of the channel who published the video. channelTitle: name of the channel who published the video. categoryId: identification number category of the video. Each number corresponds to a certain category. For example, 10 corresponds to 'music' category. Check here for the complete list. trending_date: trending date of the video. tags: tags present in the video. view_count: view count of the video. comment_count: number of comments in the video. thumbnail_link: the link of the image that appears before clicking the video. -comments_disabled: tells if the comments are disabled or not for a certain video. -ratings_disabled: tells if the rating is disabled or not for that video. -description: description below the video. Inspiration You can perform an exploratory data analysis of the dataset, working with Pandas or Numpy (if you use Python) or other data analysis libraries; and you can practice to run queries using SQL or the Pandas functions. Also, it's possible to analyze the titles, the tags and the description of the videos to search for relevant information. Remember to upvote if you found the dataset useful :).
CC0
Original Data Source: YouTube Trending Videos of the Day
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Definitions of incidence and prevalence terms.
If you’re a data scientist looking to get ahead in the ever-changing world of data science, you know that job interviews are a crucial part of your career. But getting a job as a data scientist is not just about being tech-savvy, it’s also about having the right skillset, being able to solve problems, and having good communication skills. With competition heating up, it’s important to stand out and make a good impression on potential employers.
Data Science has become an essential part of the contemporary business environment, enabling decision-making in a variety of industries. Consequently, organizations are increasingly looking for individuals who can utilize the power of data to generate new ideas and expand their operations. However these roles come with a high level of expectation, requiring applicants to possess a comprehensive knowledge of data analytics and machine learning, as well as the capacity to turn their discoveries into practical solutions.
With so many job seekers out there, it’s super important to be prepared and confident for your interview as a data scientist.
Here are 30 tips to help you get the most out of your interview and land the job you want. No matter if you’re just starting out or have been in the field for a while, these tips will help you make the most of your interview and set you up for success.
Technical Preparation
Qualifying for a job as a data scientist needs a comprehensive level of technical preparation. Job seekers are often required to demonstrate their technical skills in order to show their ability to effectively fulfill the duties of the role. Here are a selection of key tips for technical proficiency:
Make sure you have a good understanding of statistics, math, and programming languages such as Python and R.
Gain an in-depth understanding of commonly used machine learning techniques, including linear regression and decision trees, as well as neural networks.
Make sure you're good with data tools like Pandas and Matplotlib, as well as data visualization tools like Seaborn.
Gain proficiency in the use of SQL language to extract and process data from databases.
Understand and know the importance of feature engineering and how to create meaningful features from raw data.
Learn to assess and compare machine learning models using metrics like accuracy, precision, recall, and F1-score.
If the job requires it, become familiar with big data technologies like Hadoop and Spark.
Practice coding challenges related to data manipulation and machine learning on platforms like LeetCode and Kaggle.
Portfolio and Projects
Develop a portfolio of your data science projects that outlines your methodology, the resources you have employed, and the results achieved.
Participate in Kaggle competitions to gain real-world experience and showcase your problem-solving skills.
Contribute to open-source data science projects to demonstrate your collaboration and coding abilities.
Maintain a well-organized GitHub profile with clean code and clear project documentation.
Domain Knowledge
Research the industry you’re applying to and understand its specific data challenges and opportunities.
Study the company you’re interviewing with to tailor your responses and show your genuine interest.
Soft Skills
Practice explaining complex concepts in simple terms. Data Scientists often need to communicate findings to non-technical stakeholders.
Focus on your problem-solving abilities and how you approach complex challenges.
Highlight your ability to adapt to new technologies and techniques as the field of data science evolves.
Interview Etiquette
Dress and present yourself in a professional manner, whether the interview is in person or remote.
Be on time for the interview, whether it’s virtual or in person.
Maintain good posture and eye contact during the interview. Smile and exhibit confidence.
Pay close attention to the interviewer's questions and answer them directly.
Behavioral Questions
Use the STAR (Situation, Task, Action, Result) method to structure your responses to behavioral questions.
Be prepared to discuss how you have handled conflicts or challenging situations in previous roles.
Highlight instances where you’ve worked effectively in cross-functional teams...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CODE--------R markdown script 'cogcarsim_analyses.Rmd' will recompute the analyses from Palomäki et al 2021, “The Link Between Flow and Performance is Moderated by Task Experience”. Precompiled HTML output of this script is also provided.To run the script, download all contents of this Figshare object, load cogcarsim_analyses.Rmd in RStudio and knit (press Ctrl+Shift+k on Linux).Note also that to export figures, uncomment the corresponding lines of code (e.g. line 116: #ggsave(“figure4.pdf”, width=12, height=6)DATA-------SQL databases cogcarsim2_2017.db & cogcarsim2_2019.db contain the CogCarSim log data of 18 subjects, 9 from 2017 and 9 from 2019.background_2017.csv & background_2019.csv contain original profile data on 18 subjects. background_cogcarsim_2017.csv & background_cogcarsim_2019.csv contain cleaned-up, mutually compatible profile data on 18 subjects.fss_data_2017.csv & fss_data_2019.csv contain Flow Short Scale self-report data on 18 subjects. fss_learning.csv combines them and adds variables on learning derived from models fitted to data from the SQL database files. This file is generated by the accompanying R code cogcarsim_analyses.R
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset is gathered from the Coffee Quality Institute (CQI) in May, 2023. I've tweaked the script from the original author. See repository for more details.
I've provided 5 CSV files which includes data from January 2018. If you would like to practice SQL, then I recommend joining and wrangling the normalised tables. Otherwise use the full table to begin exploratory data analysis.
Bei dem aufbereiteten Längsschnitt-Datensatzes 2014 bis 2016 handelt es sich um "Big-Data", weshalb der Gesamtdatensatz nur in Form einer Datenbank (MySQL) verfügbar sein wird. In dieser Datenbank liegt die Information verschiedener Variablen eines Befragten untereinander. Die vorliegende Publikation umfasst eine SQL-Datenbank mit den Meta-Daten des Sample des Gesamtdatensatzes, das einen Ausschnitt der verfügbaren Variablen des Gesamtdatensatzes darstellt und die Struktur der aufbereiteten Daten darlegen soll, und eine Datendokumentation des Samples. Für diesen Zweck beinhaltet das Sample alle Variablen der Soziodemographie, dem Freizeitverhalten, der Zusatzinformation zu einem Befragten und dessen Haushalt sowie den interviewspezifischen Variablen und Gewichte. Lediglich bei den Variablen bezüglich der Mediennutzung des Befragten, handelt es sich um eine kleine Auswahl: Für die Onlinemediennutzung wurden die Variablen aller Gesamtangebote sowie der Einzelangebote der Genre Politik und Digital aufgenommen. Die Mediennutzung von Radio, Print und TV wurde im Sample nicht berücksichtigt, da deren Struktur anhand der veröffentlichten Längsschnittdaten der Media-Analyse MA Radio, MA Pressemedien und MA Intermedia nachvollzogen werden kann.
Die Datenbank mit den tatsächlichen Befragungsdaten wäre auf Grund der Größe des Datenmaterials bereits im kritischen Bereich der Dateigröße für den normalen Up- und Download. Die tatsächlichen Befragungsergebnisse, die zur Analyse nötig sind, werden dann 2021 in Form des Gesamtdatensatzes der Media-Analyse-Daten: IntermediaPlus (2014-2016) im DBK bei GESIS veröffentlicht werden.
Die Daten sowie deren Datenaufbereitung sind ein Vorschlag eines Best-Practice Cases für Big-Data Management bzw. den Umgang mit Big-Data in den Sozialwissenschaften und mit sozialwissenschaftlichen Daten. Unter Verwendung der GESIS Software CharmStats, die im Rahmen dieses Projektes um Big-Data Features erweitert wurde, erfolgt die Dokumentation und Herstellung der Transparenz der Harmonisierungsarbeit. Durch ein Python-Skript sowie ein html-Template wurde der Arbeitsprozess um und mit CharmStats zudem stärker automatisiert.
Der aufbereitete Längsschnitt des Gesamtdatensatzes der MA IntermediaPlus für 2014 bis 2016 wird 2021 in Kooperation mit GESIS herausgegeben werden und den FAIR-Prinzipien (Wilkinson et al. 2016) entsprechend verfügbar gemacht werden. Ziel ist es durch die Harmonisierung der einzelnen Querschnitte die Datenquelle der Media-Analyse, die im Rahmen des Dissertationsprojektes "Angebots- und Publikumsfragmentierung online" durch Inga Brentel und Céline Fabienne Kampes erfolgt, für Forschung zum sozialen und medialen Wandel in der Bundesrepublik Deutschland zugänglich zu machen.
Künftige Studiennummer des Gesamtdatensatzes der IndermediaPlus im DBK der GESIS: ZA5769 (Version 1-0-0) und der doi: https://dx.doi.org/10.4232/1.13530
****************English Version****************
The prepared Longitudinal IntermediaPlus dataset 2014 to 2016 is a "big data", which is why the entire dataset will only be available in the form of a database (MySQL). In this database, the information of different variables of a respondent is organized in one column, one below the other. The present publication includes a SQL-Database with the meta data of a sample of the full database, which represents a section of the available variables of the total data set and is intended to show the structure of the prepared data and the data-documentation (codebook) of the sample. For this purpose, the sample contains all variables of sociodemography, free-time activities, additional information on a respondent and his household as well as the interview-specific variables and weights. Only the variables concerning the respondent's media use are a small selection: For online media use, the variables of all overall offerings as well as the individual offerings of the genres politics and digital were included. The media use of radio, print and TV was not included in the sample because its structure can be traced using the published longitudinal data of the media analysis MA Radio, MA Pressemedien and MA Intermedia.
Due to the size of the datafile, the database with the actual survey data would already be in the critical range of the file size for the common upload and download. The actual survey results required for analysis will be published in 2021 in the form of the total dataset of the Longitudinal IntermediaPlus (2014-2016) dataset at the GESIS DBK.
The data as well as their data preparation are a proposal for a best practice case for big-data management and/or the handling of big data in the social sciences and with social science data. Using the GESIS software CharmStats, which was extended by big-data features within this project, the documentation and creation of transparency of the harmonization work is carried out. A Python script and an html template have been used to automate the workflow with and within CharmStats.
The full dataset of the Longitudinal IntermediaPlus for 2014 to 2016 will be published in 2021 in cooperation with GESIS and made available in accordance with the FAIR principles (Wilkinson et al. 2016). By harmonizing and pooling the cross-sectional datasets to one longitudinal dataset – which is being carried out by Inga Brentel and Céline Fabienne Kampes as part of the dissertation project "Audience and Market Fragmentation online" –, the aim is to make the data source of the media analysis, accessible for research on social and media change in the Federal Republic of Germany.
The future study number of full the Longitudinal IntermediaPlus (2014-2016) dataset at the GESIS DBK will be: ZA5769 (Version 1.0.0) and doi: https://dx.doi.org/10.4232/1.13530
Not seeing a result you expected?
Learn how you can add new datasets to our index.
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.