Ec.europa.eu



UK AnalysisHuw Fulcher (ONS - UK): huw.fulcher@.ukIntroduction?This analysis focuses on four questions:What is the distribution of vacancies over contract type and working hours?How do contract type and working hours explain Consumer Confidence (CCI) and Gross Domestic Product (GDP)?How does the overall count of job vacancies explain Consumer Confidence and Gross Domestic Product?How does the ONS vacancy figures compare to CEDEFOP?The hope is that a measure can be created from contract type and working hours that might be used in helping estimate economic statistics. This notebook contains the initial analysis towards this end. A review of findings can be found in the?Conclusion?which will include suggested actions to explore this avenue further. Unfortunately due to personal and professional constraints my analysis has not been as thorough as I wished.Please note that for certain cells the code has been folded to allow for easier viewing of visualisations. If you would like to inspect the code just click the arrows next to the code cells to expand the code.SetupThis section can be ignored if only looking at analysis. This just contains imports and functions to perform the analysis.In?[1]:from pyathenajdbc import connectimport pandas as pdimport matplotlib.pyplot as pltimport matplotlib.dates as mdatesimport numpy as npplt.style.use('ggplot')%matplotlib inlineIn?[2]:# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')In?[3]:def abline(x, y): """Plot a line from slope and intercept""" slope, intercept = np.polyfit(x, y, 1) x_vals = np.array((min(x.values), max(x.values))) y_vals = intercept + slope * x_vals return x_vals, y_valsIn?[4]:def get_oecd_gdp(): oecd_gdp = pd.read_csv("oecd_gdp.csv", delimiter="\t") oecd_gdp = oecd_gdp[oecd_gdp["LOCATION"] != "OECD"] oecd_map = { "AUT": "AT", "BEL": "BE", "DEU": "DE", "ESP": "ES", "FRA": "FR", "GBR": "UK", "ITA": "IT", "NLD": "NL", "POL": "PL", "SWE": "SE" } oecd_gdp = oecd_gdp.replace({"LOCATION": oecd_map}) oecd_gdp.rename(columns={"LOCATION": "sourcecountry", "TIME": "date", "Value": "GDP"}, inplace=True) return oecd_gdpIn?[5]:def match_oecd_gdp(data): oecd_gdp = get_oecd_gdp() oecd_trim = oecd_gdp[oecd_gdp["date"].isin(data["date"])] data_oecd = pd.merge(data, oecd_trim, on=["sourcecountry", "date"]) return data_oecdIn?[6]:def get_oecd_cci(): oecd_cci = pd.read_csv("oecd_cci.csv") oecd_cci = oecd_cci[oecd_cci["LOCATION"] != "OECD"] oecd_map = { "AUT": "AT", "BEL": "BE", "DEU": "DE", "ESP": "ES", "FRA": "FR", "GBR": "UK", "ITA": "IT", "NLD": "NL", "POL": "PL", "SWE": "SE" } oecd_cci = oecd_cci.replace({"LOCATION": oecd_map}) oecd_cci.rename(columns={"LOCATION": "sourcecountry", "TIME": "date", "Value": "CCI"}, inplace=True) oecd_cci["date"] = pd.to_datetime(oecd_cci["date"]) return oecd_cciIn?[7]:def match_oecd_cci(data): oecd_cci = get_oecd_cci() oecd_trim = oecd_cci[oecd_cci["date"].isin(data["date"])] data_oecd = pd.merge(data, oecd_trim, on=["sourcecountry", "date"]) return data_oecdIn?[8]:def convert_types(data): data["year_grab_date"] = data["year_grab_date"].astype(str) data["month_grab_date"] = data["month_grab_date"].astype(str) data["sourcecountry"] = data["sourcecountry"].astype(str) return dataIn?[9]:def convert_dates(data): data["date"] = data["year_grab_date"] + "-" + data["month_grab_date"] + "-1" data["date"] = pd.to_datetime(data["date"]) data.drop(["year_grab_date", "month_grab_date"], inplace=True, axis=1) return dataIn?[10]:def process_contract(data): data = convert_types(data) data["permenant_count"] = data["permenant_count"].astype("int64") data["temporary_count"] = data["temporary_count"].astype("int64") data["combined_count"] = data["permenant_count"] + data["temporary_count"] data = convert_dates(data) return dataIn?[11]:def process_working(data): data = convert_types(data) data["full_time_count"] = data["full_time_count"].astype("int64") data["part_time_count"] = data["part_time_count"].astype("int64") data["combined_count"] = data["full_time_count"] + data["part_time_count"] data = convert_dates(data) return dataIn?[12]:def process_overall(data): data = convert_types(data) data["total"] = data["total"].astype("int64") data = convert_dates(data) return dataIn?[13]:def process_uk(data): data = convert_types(data) data["cedefop_count"] = data["cedefop_count"].astype("int64") data["ons_adjusted_count"] = data["ons_adjusted_count"].astype("int64") data["ons_unadjusted_count"] = data["ons_unadjusted_count"].astype("int64") data = convert_dates(data) return data1. Contract Typea) What is the distribution of vacancies over contract type?OverviewIn?[14]:query_contract_count = """ SELECT contract, (count(distinct general_id)/ 1000000) AS count FROM cedefop_presentation.ft_document_essnet GROUP BY contract """contract_count = pd.read_sql(query_contract_count, conn)Unsurprisngly the majority of vacancies are composed of temporary/permenant jobs. Strangely there is the presence of jobs advertised for self employed posts. Some minor invesigation was made into this and found that a number of these jobs were coming from particular countries such as Italy. While no further investigation was made it is speculated that this might be due to misclassification by CEDEFOP or there is a difference in how self employed jobs are understood across the EU.Unfortunately there are a large number of vacancies that do not have an assigned contract type. Further work should be undertaken to see whether there is the possibility of creating some approach that could categorise these vacancies ourselves. For the purposes of this work only permenant and temporary vacancies are analysed in the rest of this section.In?[15]:ax = contract_count.plot(kind="bar", x="contract", y="count", figsize=(9,7))ax.set_title("Amount of vacancies by contract type")ax.set_xlabel("Contract Type")ax.set_ylabel("Vacancies (millions)");By countryIn?[16]:contract_month_full_query = """ SELECT year_grab_date, month_grab_date, sourcecountry, SUM(if(contract LIKE 'Permanent', 1, 0)) AS permenant_count, SUM(if(contract LIKE 'Temporary', 1, 0)) AS temporary_count FROM (SELECT DISTINCT general_id, year_grab_date, month_grab_date, sourcecountry, contract FROM cedefop_presentation.ft_document_essnet) GROUP BY year_grab_date, month_grab_date, sourcecountry ORDER BY year_grab_date, month_grab_date, sourcecountry """contract_month = pd.read_sql(contract_month_full_query, conn)In?[17]:contract_month = process_contract(contract_month.copy())contract_month.head(5)Out[17]:sourcecountrypermenant_counttemporary_countcombined_countdate092721393066018-07-011AE111122018-07-012AT140788278223562018-07-013BE5017728606787832018-07-014BG1319322018-07-01In?[18]:contract_month_sums = contract_month.groupby("sourcecountry", as_index=False).agg({"permenant_count": "sum", "temporary_count": "sum", "combined_count": "sum"})contract_month_sums.sort_values("combined_count", ascending=False, inplace=True)In?[19]:contract_month_sums_top = contract_month_sums.head(8)The most interesting observation from the below graph is the large number of temporary roles in countries such as Italy and Spain. Without much knowledge of these countries industries it would be hard to provide a concrete explanation as to why but it might be to do with seasonal jobs in industries such as farming and tourism. This would disrupt the hypothesis that an economy is more "stable" when there are more permenant jobs being advertised than temporary.There is a large difference in amount of vacancies even within the top 8 countries. Some thought will need to be given to how this is weighted if building a model around countries.In?[20]:ax = contract_month_sums_top.plot(kind="bar", x="sourcecountry", figsize=(10,8))ax.set_title("Number of Vacancies by Permenant/Temporary/Combined (Millions)")ax.set_xlabel("Country")ax.set_ylabel("Vacancies (millions)")ylabels = ['{:}'.format(x) for x in ax.get_yticks()/1000000]ax.set_yticklabels(ylabels);By monthIn?[21]:contract_month_countries = list(contract_month_sums_top["sourcecountry"])In?[22]:contract_month_top = contract_month[contract_month["sourcecountry"].isin(contract_month_countries)]In?[23]:contract_month_indexed_both = contract_month_top.set_index(["date", "sourcecountry"])contract_month_indexed_country = contract_month_top.set_index("sourcecountry")While there as some interesting trends in the two graphs below, particularly the large drop in vacancies in Feburary - March, not much can be gained from these visualisations as there is not enough monthly data. Once we have at least more than one years worth of data (ideally 3 years worth) we can begin to establish what patterns might exist.In?[24]:ax = contract_month_indexed_both.unstack(1)["permenant_count"].plot(figsize=(9,7))ax.set_title("Permenant vacancy count")ax.set_xlabel("Date")ax.set_ylabel("Vacancies")ax.legend(loc="center right", bbox_to_anchor=(1.2, 0.5));In?[25]:ax = contract_month_indexed_both.unstack(1)["temporary_count"].plot(figsize=(9,7))ax.set_title("Temporary vacancy count")ax.set_xlabel("Date")ax.set_ylabel("Vacancies")ax.legend(loc="center right", bbox_to_anchor=(1.2, 0.5));Observing the individual country trends below we can see the typical seasonal changes with peaks in vacancies before and after December. This type of seasonality is useful in the prediction of vacancies into the future but a suitable pattern needs to be discerned for the whole year.Some particular observations of note are as follows:Temporary and permenant vacancies become inverse in April 2019 for Spain. Looking at the 2018 summer data and assuming a similar pattern it would seem that this is the beginning of an increase of temporary summer jobs although we would need more data to confirm this.France has a large reduction in permenant vacancies after January.Permenant and Temporary vacancies in the UK and Germany follow patterns closely. This would indicate that a combined count of vacancies would work just as well as an indicator for these particular countries. The expectation was that different countries would naturally have different distributions of jobs which would make modelling economic indicators across countries more difficult.In?[26]:for country, dataframe in contract_month_indexed_country.groupby("sourcecountry"): ax = dataframe.plot(kind="line", x="date", figsize=(9,7)) ax.set_title(f"{country} job vacancies") ax.set_xlabel("Date") ax.set_ylabel("Vacancies") ax.legend(loc="center right", bbox_to_anchor=(1.35, 0.5))b) How does contract type explain CCI?In?[27]:contract_month_cci = match_oecd_cci(contract_month_top)In?[28]:contract_month_cci.head(5)Out[28]:sourcecountrypermenant_counttemporary_countcombined_countdateCCI0BE5017728606787832018-07-01100.475801DE2616643387586004222018-07-01101.351002ES29960706431006032018-07-01101.259203FR4403884308898712772018-07-0199.841674IT26395937351201302018-07-01100.97700In?[29]:contract_month_cci_indexed = contract_month_cci.set_index("date")In?[30]:contract_month_cci_indexed.head(5)Out[30]:sourcecountrypermenant_counttemporary_countcombined_countCCIdate2018-07-01BE501772860678783100.475802018-07-01DE261664338758600422101.351002018-07-01ES2996070643100603101.259202018-07-01FR44038843088987127799.841672018-07-01IT2639593735120130100.97700Looking across all countries there is very little evidence that CCI correlates with contract type. In certain instances it could be argued that CCI follows the general trend of vacancies (e.g. NL and UK) although it is unclear whether CCI is being explained by vacancy count or vice versa. As stated before a better understanding of the interaction between contract type and CCI would be obtained from having more data.In?[31]:for country, dataframe in contract_month_cci_indexed.groupby("sourcecountry"): correlation_perm = dataframe["permenant_count"].corr(dataframe["CCI"]) correlation_temp = dataframe["temporary_count"].corr(dataframe["CCI"]) correlation_combined = dataframe["combined_count"].corr(dataframe["CCI"]) fig, ax1 = plt.subplots(figsize=(12,6)) ax2 = ax1.twinx() color = "tab:red" ax1.set_title(f"{country}, Correlation Permenant: {correlation_perm:.2f}, Correlation Temporary: {correlation_temp:.2f}, Correlation Combined: {correlation_combined:.2f}") ax1.set_xlabel("Date") ax1.set_ylabel("Vacancies") p1 = [] for frame, label in zip(["permenant_count", "temporary_count", "combined_count"], ["Permenant", "Temporary", "Combined"]): p1.append(ax1.plot(dataframe[frame], label=label)) p1 = [p[0] for p in p1] ax1.tick_params(axis='y') color = "tab:pink" ax2.set_ylabel("CCI, Long-term average=100") p2 = ax2.plot(dataframe["CCI"], color=color, label="CCI") ax2.tick_params(axis='y', labelcolor=color) p = p1 + p2 labs = [l.get_label() for l in p] ax1.legend(p, labs, loc="center right", bbox_to_anchor=(1.45, 0.5)) fig.tight_layout()c) How does contract type explain GDP?In?[32]:contract_year_gdp = contract_month_top.groupby([contract_month["date"].dt.year, "sourcecountry"]).agg({"permenant_count": "sum", "temporary_count": "sum", "combined_count": "sum"})contract_year_gdp = contract_year_gdp.reset_index(level=[0,1])In?[33]:contract_year_gdp.head(5)Out[33]:datesourcecountrypermenant_counttemporary_countcombined_count02018BE37188322124159312412018DE20887322445991453472322018ES15662439501455163832018FR4321065331462665252742018IT16952758233151858In?[34]:contract_year_gdp = contract_year_gdp[contract_year_gdp["date"].isin(["2018"])]contract_2018_oecd_gdp = match_oecd_gdp(contract_year_gdp)contract_2018_oecd_gdp = contract_2018_oecd_gdp.drop("date", axis=1)In?[35]:contract_2018_oecd_gdp.head(5)Out[35]:sourcecountrypermenant_counttemporary_countcombined_countGDP0BE3718832212415931245.752916e+051DE2088732244599145347234.456151e+062ES1566243950145516381.865002e+063FR4321065233146266525273.037360e+064IT1695275823317518582.515781e+06While the correlation scores look promising upon inspection of the scatter plots it seems that there is not a particularly strong correlation between contract type and GDP. At best temporary vacancies seem to capture the general trend but would not necessarily be suitable for estimation. When?compared against the total vacancies correlation-How-does-vacancies-overall-explain-GDP?) it seems that by filtering by contract type we are just losing information that would provide more accurate predictions.In?[36]:titles = ["Permenant Vacancies", "Temporary Vacancies", "Combined Vacancies"]variables = ["permenant_count", "temporary_count", "combined_count"]for title, variable in zip(titles, variables): correlation = contract_2018_oecd_gdp["GDP"].corr(contract_2018_oecd_gdp[variable]) fig, ax1 = plt.subplots(figsize=(7,7)) ax1.scatter(contract_2018_oecd_gdp["GDP"], contract_2018_oecd_gdp[variable], s=100) ax1.set_title(f"{title} - Correlation: {correlation:.2f}") ax1.set_xlabel("GDP") ax1.set_ylabel("Vacancies (Millions)") ylabels = ["{:}".format(x) for x in ax1.get_yticks()/1000000] x, y = abline(contract_2018_oecd_gdp["GDP"], contract_2018_oecd_gdp[variable]) ax1.plot(x, y, '--') for i, country in enumerate(contract_2018_oecd_gdp["sourcecountry"]): ax1.annotate(country, (contract_2018_oecd_gdp["GDP"].iat[i]+30000, contract_2018_oecd_gdp[variable].iat[i]+30000)) ax1.grid()2. Working Hoursa) What is the distribution of vacancies over working hours?OverviewIn?[37]:query_working_count = """ SELECT working_hours, (count(distinct general_id)/ 1000000) AS count FROM cedefop_presentation.ft_document_essnet GROUP BY working_hours """working_count = pd.read_sql(query_working_count, conn)As with Contract Type there are a large amount of vacancies that do not have Working Hours assigned. This should feed into a larger piece of imputation work for the whole dataset.In?[38]:ax = working_count.plot(kind="bar", x="working_hours", y="count")ax.set_title("Amount of vacancies by working hours")ax.set_xlabel("Working Hours")ax.set_ylabel("Vacancies (Millions)");By countryIn?[39]:working_month_full_query = """ SELECT year_grab_date, month_grab_date, sourcecountry, SUM(if(working_hours LIKE 'Full time', 1, 0)) AS full_time_count, SUM(if(working_hours LIKE 'Part time', 1, 0)) AS part_time_count FROM (SELECT DISTINCT general_id, year_grab_date, month_grab_date, sourcecountry, working_hours FROM cedefop_presentation.ft_document_essnet) GROUP BY year_grab_date, month_grab_date, sourcecountry ORDER BY year_grab_date, month_grab_date, sourcecountry """working_month = pd.read_sql(working_month_full_query, conn)In?[40]:working_month = process_working(working_month.copy())In?[41]:working_month_sums = working_month.groupby("sourcecountry", as_index=False).agg({"full_time_count": "sum", "part_time_count": "sum", "combined_count": "sum"})working_month_sums.head(5)Out[41]:Source-countryfull_time_countpart_time_countcombined_count02894111740111AE5310401193212AT4902923853838756753BA1344BE14175414021951819736In?[42]:working_month_sums.sort_values("combined_count", ascending=False, inplace=True)working_month_sums_top = working_month_sums.head(8)Looking at the split between Full Time and Part Time it's clear that there is an emphasis on Full Time vacancies, particularly in DE, BE and IT. Depending on the breakdown over time this might provide an indicator of the economy. If the proportion of Full Time to Part Time changes will this mean that the economy is altered? The only country of the top 8 that has more Part Time than Full Time vacancies is Sweden, this might be an avenue of research to take in future.In?[43]:ax = working_month_sums_top.plot(kind="bar", x="sourcecountry", figsize=(9,7))ax.set_title("Number of Vacancies by Full Time/Part Time/Combined (Millions)")ax.set_xlabel("Country")ax.set_ylabel("Vacancies (Millions)")ylabels = ['{}'.format(x) for x in ax.get_yticks()/1000000]ax.set_yticklabels(ylabels);By monthIn?[44]:working_month_countries = list(working_month_sums_top["sourcecountry"])In?[45]:working_month_top = working_month[working_month["sourcecountry"].isin(working_month_countries)]In?[46]:working_month_indexed_both = working_month_top.set_index(["date", "sourcecountry"])working_month_indexed_country = working_month_top.set_index("sourcecountry")Note: The higher of the two red lines is the UK, the lower is BE. Unfortunatel Matplotlib does not support more than 8 colours as standard.Looking at the two graphs below there is clear evidence again of seasonal changes with a much more pronounced effect for Part Time jobs. This can be attributed to pre-Christmas hiring for seasonal staff although this does not explain the large increase in January. This might be for additional staff to help the retail industry get back to normal operations, this is only an uneducated guess. The increase in Full Time vacancies post Christmas can be most likely attributed to the lack of advertising around Christmas due to the expected low number of applicants.In?[47]:ax = working_month_indexed_both.unstack(1)["full_time_count"].plot(figsize=(10,6))ax.set_title("Full time vacancy count")ax.set_xlabel("Date")ax.set_ylabel("Vacancies (Millions)")ylabels = ['{:}'.format(x) for x in ax.get_yticks()/1000000]ax.set_yticklabels(ylabels)ax.legend(loc="center right", bbox_to_anchor=(1.2, 0.5));In?[48]:ax = working_month_indexed_both.unstack(1)["part_time_count"].plot(figsize=(10,6))ax.set_title("Part time vacancy count")ax.set_xlabel("Date")ax.set_ylabel("Vacancies")ax.legend(loc="center right", bbox_to_anchor=(1.2, 0.5));Observing individual countries shows again the Christmas seasonal effect but also highlights seasonal changes unique to certain countries. For example Sweden has a large jump in vacancies in the March to April period before trailing back off.Additionally it can be seen that much of the combined vacancy count is driven by Full Time vacancies. How much of this is due to the disproportion between Full Time and Part Time has not been investigated.What is clear from these plots is that there is a challenge to be faced in attempting to produce a generalised model for these vacancies, as with Contract Type.In?[49]:for country, dataframe in working_month_indexed_country.groupby("sourcecountry"): ax = dataframe.plot(kind="line", x="date", figsize=(10,6)) ax.set_title(f"{country} job vacancies") ax.set_xlabel("Date") ax.set_ylabel("Vacancies") ax.legend(loc="center right", bbox_to_anchor=(1.3, 0.5))b) How does working hours explain CCI?In?[50]:working_month_cci = match_oecd_cci(working_month_top)In?[51]:working_month_cci_indexed = working_month_cci.set_index("date")In?[52]:working_month_cci_indexed.head(5)Out[52]:Source countryfull_time_countpart_time_countcombined_countCCIdate2018-07-01BE9950222920122422100.475802018-07-01DE822458153807976265101.351002018-07-01ES7541739594115011101.259202018-07-01FR30853333332164185499.841672018-07-01IT10472629397134123100.97700From just a cursory look at these figures it is clear to see that there is little interaction between CCI and Working Hours.In?[53]:for country, dataframe in working_month_cci_indexed.groupby("sourcecountry"): correlation_full = dataframe["full_time_count"].corr(dataframe["CCI"]) correlation_part = dataframe["part_time_count"].corr(dataframe["CCI"]) correlation_combined = dataframe["combined_count"].corr(dataframe["CCI"]) fig, ax1 = plt.subplots(figsize=(12,6)) ax2 = ax1.twinx() color = "tab:red" ax1.set_title(f"{country}, Correlation Full-Time: {correlation_full:.2f}, Correlation Part-Time: {correlation_part:.2f}, Correlation Combined: {correlation_combined:.2f}") ax1.set_xlabel("Date") ax1.set_ylabel("Vacancies") p1 = [] for frame, label in zip(["full_time_count", "part_time_count", "combined_count"], ["Full Time", "Part Time", "Combined"]): p1.append(ax1.plot(dataframe[frame], label=label)) p1 = [p[0] for p in p1] ax1.tick_params(axis='y') color = "tab:pink" ax2.set_ylabel("CCI, Long-term average=100") p2 = ax2.plot(dataframe["CCI"], color=color) ax2.tick_params(axis='y', labelcolor=color) p = p1 + p2 labs = [l.get_label() for l in p] ax1.legend(p, labs, loc="center right", bbox_to_anchor=(1.45, 0.5)) fig.tight_layout()c) How does working hours explain GDP?In?[54]:working_year_gdp = working_month_top.groupby([contract_month["date"].dt.year, "sourcecountry"]).agg({"full_time_count": "sum", "part_time_count": "sum", "combined_count": "sum"})working_year_gdp = working_year_gdp.reset_index(level=[0,1])In?[55]:working_year_gdp = working_year_gdp[working_year_gdp["date"].isin(["2018"])]working_2018_oecd_gdp = match_oecd_gdp(working_year_gdp)working_2018_oecd_gdp = working_2018_oecd_gdp.drop("date", axis=1)In?[56]:working_2018_oecd_gdp.head(5)Out[56]:Source countryfull_time_countpart_time_countcombined_countGDP0BE6289161980708269865.752916e+051DE6248932118453174334634.456151e+062ES3543622490326033941.865002e+063FR1980036210074740807833.037360e+064IT6607691822818430502.515781e+06Compared to Contract Type it seems that Full Time vacancies provide at least a general indicator of GDP. Part Time vacancies might perform better if the type categories are rebalanced correctly. Work should be undertaken to understand how much of the explanation of GDP is related to just the general number of vacancies and how much the filtering by Working Hours has an effect.In?[57]:titles = ["Full Time Vacancies", "Part Time Vacancies", "Combined Vacancies"]variables = ["full_time_count", "part_time_count", "combined_count"]for title, variable in zip(titles, variables): correlation = working_2018_oecd_gdp["GDP"].corr(working_2018_oecd_gdp[variable]) fig, ax1 = plt.subplots(figsize=(7,7)) ax1.scatter(working_2018_oecd_gdp["GDP"], working_2018_oecd_gdp[variable], s=100) ax1.set_title(f"{title} - Correlation: {correlation:.2f}") ax1.set_xlabel("GDP") ax1.set_ylabel("Vacancies (Millions)") ylabels = ["{:}".format(x) for x in ax1.get_yticks()/1000000] ax1.set_yticklabels(ylabels) x, y = abline(working_2018_oecd_gdp["GDP"], working_2018_oecd_gdp[variable]) ax1.plot(x, y, '--') for i, country in enumerate(working_2018_oecd_gdp["sourcecountry"]): ax1.annotate(country, (working_2018_oecd_gdp["GDP"].iat[i]+30000, working_2018_oecd_gdp[variable].iat[i]+30000)) ax1.grid()3. Overall countsa) What is the distribution of vacancies overall?By countryIn?[58]:overall_month_full_query = """ SELECT year_grab_date, month_grab_date, sourcecountry, COUNT(DISTINCT general_id) as total FROM cedefop_presentation.ft_document_essnet GROUP BY year_grab_date, month_grab_date, sourcecountry ORDER BY year_grab_date, month_grab_date, sourcecountry """overall_month = pd.read_sql(overall_month_full_query, conn)In?[59]:overall_month = process_overall(overall_month.copy())In?[60]:overall_month_sums = overall_month.groupby("sourcecountry", as_index=False).agg({"total": "sum"})overall_month_sums.head(5)Out[60]:sourcecountrytotal050991AE117492AT18507473BA74BE2488290In?[61]:overall_month_sums.sort_values("total", ascending=False, inplace=True)overall_month_sums_top = overall_month_sums.head(8)In?[62]:ax = overall_month_sums_top.plot(kind="bar", x="sourcecountry", figsize=(9,7))ax.set_title("Amount of vacancies by Total (Millions)")ax.set_xlabel("Country")ax.set_ylabel("Vacancies (millions)")ylabels = ["{:}".format(x) for x in ax.get_yticks()/1000000]ax.set_yticklabels(ylabels);By monthIn?[63]:overall_month_countries = list(overall_month_sums_top["sourcecountry"])In?[64]:overall_month_top = overall_month[overall_month["sourcecountry"].isin(overall_month_countries)]In?[65]:overall_month_indexed_both = overall_month_top.set_index(["date", "sourcecountry"])overall_month_indexed_country = overall_month_top.set_index("sourcecountry")In?[66]:ax = overall_month_indexed_both.unstack(1)["total"].plot(figsize=(10,6))ax.set_title("Total vacancy count")ax.set_xlabel("Date")ax.set_ylabel("Vacancies")ax.legend(loc="center right", bbox_to_anchor=(1.2, 0.5));While most countries follow a similar seasonal pattern it is worth noting that Spain and Poland have two very different trends to the rest. For Spain there is the standard end of year peak in vacancies but it is not followed by the second peak in January. For Poland there is a large increase in vacancies in Poland before dropping sharply until January when it peaks again. Poland is the most interesting of the two as there doesn't seem to be any indication of a pre-Christmas increase in vacancies, this should be investigated further.In?[67]:for country, dataframe in overall_month_indexed_country.groupby("sourcecountry"): ax = dataframe.plot(kind="line", x="date", figsize=(10,6)) ax.set_title(f"{country} job vacancies") ax.set_xlabel("Date") ax.set_ylabel("Vacancies") ax.legend(loc="center right", bbox_to_anchor=(1.2, 0.5))b) How does vacancies overall explain CCI?In?[68]:overall_month_cci = match_oecd_cci(overall_month_top)In?[69]:overall_month_cci_indexed = overall_month_cci.set_index("date")In?[70]:overall_month_cci_indexed.head(5)Out[70]:Source countrytotalCCIdate2018-07-01BE150580100.475802018-07-01DE1378215101.351002018-07-01ES159510101.259202018-07-01FR114899399.841672018-07-01IT222130100.97700As with other attempts to explain CCI there seems to be little relation with vacancy amount. The only exception to this seems to be France where the increase in vacancies correlates with a quick decline of consumer confidence. Without more analysis of other economic indicators I would be hesitant to say there is anything meaningful in this correlation.In?[71]:for country, dataframe in overall_month_cci_indexed.groupby("sourcecountry"): correlation = dataframe["total"].corr(dataframe["CCI"]) fig, ax1 = plt.subplots(figsize=(12,6)) ax2 = ax1.twinx() ax1.set_title(f"{country}, Correlation: {correlation:.2f}") ax1.set_xlabel("Date") ax1.set_ylabel("Vacancies") p1 = ax1.plot(dataframe[["total"]], label="Total") ax1.tick_params(axis='y') color = "tab:pink" ax2.set_ylabel("CCI, Long-term average=100") p2 = ax2.plot(dataframe["CCI"], color=color) ax2.tick_params(axis='y', labelcolor=color) p = p1 + p2 labs = [l.get_label() for l in p] ax1.legend(p, labs, loc="center right", bbox_to_anchor=(1.4, 0.5)) fig.tight_layout()c) How does vacancies overall explain GDP?In?[72]:overall_year_gdp = overall_month_top.groupby([overall_month["date"].dt.year, "sourcecountry"]).agg({"total": "sum"})overall_year_gdp = overall_year_gdp.reset_index(level=[0,1])In?[73]:overall_year_gdp = overall_year_gdp[overall_year_gdp["date"].isin(["2018"])]overall_2018_oecd_gdp = match_oecd_gdp(overall_year_gdp)overall_2018_oecd_gdp = overall_2018_oecd_gdp.drop("date", axis=1)In?[74]:overall_2018_oecd_gdp.head(5)Out[74]:Source countrytotalGDP0BE11095985.752916e+051DE103333314.456151e+062ES11244781.865002e+063FR84891183.037360e+064IT14061192.515781e+06Aligning with other work done by workpackage members it seems that vacancy count serves as a good indicator of GDP.In?[75]:fig, ax1 = plt.subplots(figsize=(7,7))correlation = overall_2018_oecd_gdp["GDP"].corr(overall_2018_oecd_gdp["total"])ax1.scatter(overall_2018_oecd_gdp["GDP"], overall_2018_oecd_gdp["total"], s=100)ax1.set_title(f"Total Vacancies - Correlation: {correlation:.2f}")ax1.set_xlabel("GDP")ax1.set_ylabel("Vacancies (Millions)")ylabels = ["{:}".format(x) for x in ax1.get_yticks()/1000000]ax1.set_yticklabels(ylabels)x, y = abline(overall_2018_oecd_gdp["GDP"], overall_2018_oecd_gdp["total"])ax1.plot(x, y, '--')for i, country in enumerate(overall_2018_oecd_gdp["sourcecountry"]): ax1.annotate(country, (overall_2018_oecd_gdp["GDP"].iat[i]+30000, overall_2018_oecd_gdp["total"].iat[i]+30000))ax1.grid()4. ONS vs CEDEFOPIn?[76]:uk_month_full_query = """ SELECT year_grab_date, month_grab_date, COUNT(distinct general_id) as cedefop_count FROM cedefop_presentation.ft_document_essnet WHERE sourcecountry = 'UK' GROUP BY year_grab_date, month_grab_date ORDER BY year_grab_date, month_grab_date """uk_month = pd.read_sql(uk_month_full_query, conn)In?[77]:uk_vacancies = pd.read_csv("uk_vacancies.csv")In?[78]:uk_merged = pd.merge(uk_month, uk_vacancies, on=["year_grab_date", "month_grab_date"])uk_merged["sourcecountry"] = "UK"uk_merged = process_uk(uk_merged)In?[79]:uk_merged_indexed = uk_merged.set_index("date")There is little, if any correlation between CEDEFOP count of vacancies and ONS' official count of vacancies (See second figure for a clearer view of variance in ONS counts). While I can only speculate at this stage, there are a number of reasons as to why this might be the case.While efforts are made to remove duplicates from the CEDEFOP data there will inherently be duplicates contained within due to the nature of the data collection. The CEDEFOP data is composed of Adzuna data and various job boards specific to certain countries. Adzuna's main process for collecting vacancies is to scrape these job boards for their vacancies, therefore by keeping the Adzuna data and job boards together there will be the duplication of vacancies. This is explained with the assumption that steps have not been taken to rectify this first, in the dataset's current state we cannot discern this ourselves. Without access to the job description it will be very difficult to perform fuzzy matching to filter out duplicates.The ONS Vacancy Survey methodology can be found?on the ONS website?but a simple explanation is provided here. The Vacancy Survey samples roughly 6,000 companies stratified by industry and size every month. While the methodology for this approach is proven robust there are improvements that could be made in relation to how certain industries are monitored. A particular example is the Technology sector where there are an ever growing number of small businesses that recruit often as turnover can be high. If the survey is not weighted correctly to account for this (I have no knowledge as to whether this is the case) then there may be undercounting of vacancies.Ultimately, until we can be confident of sufficient deduplication of the CEDEFOP data I am hesitant to say anything further on the difference between CEDEFOP counts and ONS counts.In?[80]:ax = uk_merged_indexed.plot(figsize=(9,7))ax.set_title("Comparison between CEDEFOP and ONS job vacancy statistics")ax.set_xlabel("Date")ax.set_ylabel("Vacancies")ax.set_ylim([0, 2500000])ax.legend(loc="center right", bbox_to_anchor=(1.35, 0.5));Note: Adjusted and Unadjusted relate to seasonal adjustmentIn?[84]:uk_merged_indexed["cedefop_count"].corr(uk_merged_indexed["ons_unadjusted_count"])Out[84]:0.2671440918727458In?[81]:cedefop_first = uk_merged_indexed.iloc[0]["cedefop_count"]ons_adjusted_count_first = uk_merged_indexed.iloc[0]["ons_adjusted_count"]ons_unadjusted_count_first = uk_merged_indexed.iloc[0]["ons_unadjusted_count"]In?[82]:uk_merged_indexed["cedefop_proportion"] = uk_merged_indexed["cedefop_count"] / cedefop_firstuk_merged_indexed["ons_adjusted_proportion"] = uk_merged_indexed["ons_adjusted_count"] / ons_adjusted_count_firstuk_merged_indexed["ons_unadjusted_proportion"] = uk_merged_indexed["ons_unadjusted_count"] / ons_unadjusted_count_firstIn?[83]:uk_merged_indexed[["cedefop_proportion", "ons_adjusted_proportion", "ons_unadjusted_proportion"]].plot(figsize=(9,7));ConclusionTo conclude, there does not seem to be a clear indication that Contract Type or Working Hours are a good predictor of CCI or GDP. While there is promising results of Working Hours and GDP there would need to be more work undertaken to the pre-processing of the CEDEFOP data to give more concrete conclusions.While this analysis may not have achieved a suitable indicator it has helped give a better understanding of the distribution of the data and provided further avenues for research. In particular a better understanding of individual countries trends (such as Spain and Poland) might provide opportunities for better indicators to be developed. One avenue which should be prioritized is the deduplication of the CEDEFOP data, until that is accomplished all results that are found need to be approached with caution. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download