Nikita Volodarskiy

How I See the IT Job Market in 2026

Nikita Volodarskiy — Wed, 13 May 2026 11:02:19 GMT

Before we begin, let me set up some constraints – my experience is based on:

Germany, first half of 2026
BI/Data/Business Analyst & Data Engineer positions
Personal feelings about the topic; individual experiences may vary

With that being said, let’s proceed.

From what my friends working in IT told me, it is very hard to hire people for the positions mentioned above. One shared that they were looking for 6 months without any results, while having interviews with candidates regularly. And this looks quite like a common case.

So what’s happening? I think there are a lot of underqualified people in IT looking for a job. Thousands of applications per position. The recruiting team can only interview a handful, a drop in the ocean.

This is why I believe cold applications do not work. Because there will be 999 other people who engineered their CV to match this job position. There will be 0 difference and no reason to prioritize you even if you are qualified. The problem is that in the end the recruiting team ends up interviewing unqualified candidates instead of you. Sorry, that’s maths.

This reminds me of Game Theory and the concept of Signaling Games, where you need to send some kind of signal so that the receiver sees you as a qualified candidate, while underqualified candidates cannot send this signal because it costs too much for them to produce.

Now let’s think about what those signals can be:

University
Online course certificates
Official certifications
Previous experience
Network

University can be a somewhat difficult thing to get through, and can be a nice additional signal if it matches the position.

Courses and certifications really show nothing at all. At best they show that the person studied and passed an exam. At worst it shows that the person clicked through videos, skipped through materials, did homework with AI, and memorized answers to certification questions. Of course some effort is required, but hey! They are also motivated to get a job.

Previous experience – there’s a catch. Write anything you want on your LinkedIn, come up with nice stories about what you did. However, having been an interviewer for 3 years, I can say that in most cases you can identify made-up experience through questions about specific cases. Just dig a bit deeper. Ask about implementation, architecture – things that someone who was actually there would know, even if they weren’t responsible for it. The person who made it up doesn’t have that experience. So again, made-up experience increases the load on the recruiting team, because it may even require 2 interviews to decline a candidate.

Network – I found this to be the best signal that helped me get interviews. And I would even say the only instrument that worked.

The key thing I found: it should be your real network, not some random person who works at a company, whom you reached out to on LinkedIn and who is okay to refer you.

When people actually know you, the recruiting team will reach out to them and they will provide valuable feedback. Don’t underestimate it! If the person doesn’t know you, they will just say “We met on LinkedIn, their profile looks relevant.” That’s it – doesn’t sound encouraging. While your ex-colleague, or someone you know well from a different company, will be able to tell a real story about you.

This is a very good signal – it weighs more than all the previous ones. I don’t have access to applicant scoring models, but this is how I see it: when you apply cold you are 1 of 1000; being referred and you can be the only one for the position. You get into a separate queue entirely. Now it only depends on you, how good you really are.

So how do you build a real network? You don’t build it when you need a job – build it before. Have good relationships with your colleagues. Go to meetups, contribute to discussions, help people with their problems, stay in touch with former colleagues. The network is the side effect of genuinely engaging with your professional community over time. You can’t fake it, which is exactly what makes it a strong signal.

The junior problem

This whole framework breaks down for juniors. They don’t have a network yet and lack relevant experience. They may have university degrees, certificates, pet projects – and that’s it. Which makes it extremely hard for a talented new graduate to send a strong signal to a hiring company.

The result of all this, unfortunately, is that the recruiting team still gets overloaded with thousands of irrelevant candidates. The junior hiring problem is mostly unsolved, and I don’t see it getting better soon. (Especially given that AI doesn't make it any better)

Where does this leave us

The job market for data roles right now rewards one thing above everything else: people who know you. Not your certificates, not your LinkedIn profile, not even your CV. People who can say something real about you.

If you are mid-level or senior invest in your network before you need it. If you are a junior – I am sorry, I really have no idea what to say here. But I am confident that cold applications are a lottery with the odds being not in your favor.

Deduplication Spark vs Athena (Trino)

Nikita Volodarskiy — Wed, 29 Apr 2026 18:54:57 GMT

We had a table in Athena that grew enormously to 16 billion records, where only 6 billion records were unique. This happened because a join condition was incorrectly populating duplicates over a long period of time. After we fixed the underlying condition, we also needed to deduplicate the existing data to restore query performance.

For context: it’s a Hive external table without partitions, stored as Parquet files on S3.

Athena

We tried a couple of approaches to delete duplicates directly in Athena.

Row number window function

The first approach was the standard one – use ROW_NUMBER() to rank rows per ID and keep only the latest:

SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY last_modified DESC) AS rn
  FROM data_lake.customer
)
WHERE rn = 1

It failed immediately with Query exhausted resources. Trino has limited ability to spill to disk for window functions and in practice, large window partitions frequently exhaust per-query memory.

Join-based approach

A known workaround in Athena is to replace the window function with a self-join, which Trino can handle more efficiently:

WITH grouped_max AS (
  SELECT id, MAX(last_modified) AS max_modified
  FROM data_lake.customer
  GROUP BY id
)
SELECT a.*
FROM data_lake.customer a
JOIN grouped_max b ON a.id = b.id AND a.last_modified = b.max_modified

This approach works well when duplicates differ by last_modified. In our case though, we had complete duplicates – rows where every column, including last_modified, was identical. The join would match all of them and return duplicates unchanged.

We also tried SELECT DISTINCT and GROUP BY with MAX, but both hit the same Query exhausted resources wall. At 16 billion rows, Athena simply doesn’t have enough per-query memory to materialise the intermediate state any of these approaches require.

So all Athena approaches were dead ends.

Spark

We decided to run the deduplication in PySpark in local mode on a dedicated Kubernetes pod (16 vCPU, 120 GB RAM). We don’t have a Spark cluster, so local mode on a large pod was the pragmatic choice – no cluster overhead, no executor coordination, just one JVM with all available memory.

The obvious approach, read the entire table and run a window function, also fails here for the same fundamental reason: shuffling 16 billion rows requires materialising all of them on a single machine, which no reasonable pod size can support.

Hash-partitioned batches

The solution was to split the work into 32 independent buckets using a deterministic hash function:

pmod(xxhash64(id), 32) == bucket  # bucket in 0..31

xxhash64 returns a signed 64-bit integer. pmod (positive modulo, as opposed to the % remainder operator) always produces a value in [0, 31] regardless of the sign of the hash. This is important – with the regular % operator, negative hash values produce negative remainders that never match any bucket, silently dropping roughly half of all IDs. pmod eliminates that.

Because the hash is deterministic, all rows for a given id always land in exactly one bucket. This guarantees deduplication is complete – no ID can be split across jobs.

Each of the 32 buckets ran as an independent job on a separate Kubernetes pod:

bucket_filter = F.pmod(F.xxhash64(F.col("id")), F.lit(32)) == bucket

window = Window.partitionBy("id").orderBy(F.col("last_modified").desc())
deduped = (
    spark.read.parquet(source_path)
    .filter(bucket_filter)
    .withColumn("rn", F.row_number().over(window))
    .filter(F.col("rn") == 1)
    .drop("rn")
)
deduped.write.mode("append").parquet(destination_path)

Up to 8 buckets ran in parallel, each writing to the same output prefix in append mode.

Why didn’t scanning the full table cause OOM?

Each job reads the entire 16B-row table to find its 1/32 slice. You’d expect that to cause memory problems – but it doesn’t.

Spark reads Parquet in columnar batches of 4096 rows by default. Within each batch, whole-stage code generation fuses the scan and filter into a single compiled code path: rows that fail the predicate are skipped inline and never passed to the next operator as intermediate objects.

Rows that pass the filter are buffered in memory. Periodically, Spark checks whether the task can still acquire memory from the execution pool. When it can’t, the buffer is sorted and spilled to a temporary file on local disk, freeing memory for the next batch. At the end of the scan, all spill files and any remaining buffer are merged into the final shuffle output file.

This is why memory doesn’t grow with table size: the read buffer is 4096 rows (by default), and the write buffer is kept bounded by the execution memory pool through incremental spills – not by the number of rows scanned. Disk accumulates the ~500M surviving rows for this bucket throughout the scan. Only once all map-side tasks are complete does the stage barrier release and the second phase begin.

The two phases have different resource profiles:

Scan + shuffle write: memory bounded by execution pool; disk accumulates ~500M rows
Sort + window: reads the shuffle files and sorts rows by id and last_modified; the window operator then evaluates row-by-row over the sorted partitions

Per bucket, in our case ~500M rows fit comfortably within the 175 Gi ephemeral volume, with room for sort spill.. A single job on all 16B rows would produce ~32x more shuffle data on disk, well beyond 175 Gi, before the sort phase even starts.

Phase 1 – Scan + shuffle write:
  S3 read -> 4096-row batches (vectorized reader)
      -> filter: 31/32 rows dropped via WSCG inline code
      -> survivors -> ExternalSorter (in-memory buffer)
      -> buffer spills to disk when execution pool exhausted
      -> repeat for all 16B rows
      -> result: ~500M rows in shuffle files on local disk (~125 GB)

Phase 2 – Sort + window:
  [stage barrier: waits for all Phase 1 tasks to complete]
  read shuffle files -> SortExec (UnsafeExternalSorter, spills if needed)
      -> WindowExec: row-by-row over pre-sorted partitions
      -> ROW_NUMBER, keep rn=1
      -> write deduplicated rows to S3

Could the batching approach work in Athena?

The truth is, it may. After ruling out the straightforward approaches and learning more about how Athena handles spilling to disk, we decided Spark would be a better tool here. Too much data? No problem – just spill to disk and wait 2 hours for deduplication to happen.

But if you still want to stay in Athena, you could try applying the same bucket filter there:

SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY last_modified DESC) AS rn
  FROM data_lake.customer
  WHERE abs(from_big_endian_64(xxhash64(to_utf8(cast(id AS varchar))))) % 32 = 0
)
WHERE rn = 1

Run it 32 times, once per bucket. or even try increasing the number of buckets.

The unknowns are whether Athena’s limited window function spill support holds up under that load, whether shared worker memory leaves enough headroom for your bucket size, and whether you stay within the 30-minute query timeout (though this one can be increased up to 240 min)

It might just work. Give it a try and see.

ElasticSearch Federated Query via Athena Connector

Nikita Volodarskiy — Fri, 24 Apr 2026 15:36:49 GMT

When you need Elasticsearch data in Athena for analytics, the Athena Federated Query connector looks like the obvious first choice – no pipelines, no exports, just SQL on top of live ES data. This post covers what we ran into when we tried it, so you don’t have to spend the time on a PoC yourself.

Context

Our Elasticsearch cluster has a handful of indexes that the analytics team needs to query in Athena. The smallest is ~50 MB; others on the average around 13 GB each. ES is optimized for search, not analytics, and doesn’t integrate with Athena natively – so the federated connector is the most direct path to bridge that gap.

What Is Athena Federated Query

Athena Federated Query lets you run SQL against external data sources without moving the data first. AWS maintains a set of Lambda-backed connectors for common sources: DynamoDB, RDS, Redis, and others. You deploy the connector as a Lambda function, register it as a data catalog in Athena, and then query it alongside your S3-based tables using standard SQL.

For Elasticsearch and OpenSearch, AWS provides the Amazon Athena connector for Elasticsearch/OpenSearch, available through the Serverless Application Repository (SAR).

Setup

The connector is deployed from SAR with a handful of parameters:

ES endpoint – the cluster URL
Secret ARN – credentials stored in AWS Secrets Manager (username/password or API key)
Spill bucket and prefix – S3 location used when query result sets exceed Lambda memory limits
Lambda function name – becomes your data catalog identifier in Athena

Once deployed, you register the Lambda as a data source in the Athena console under Data sources → Connect data source → Lambda. After that, the ES indexes appear as databases and tables within Athena – no schema definition required, since the connector infers it from ES mappings (But there is a twist, wait for it).

Two Connector Variants

Two SAR applications exist for this connector:

AthenaElasticsearchConnector – the standard variant. Connects to ES using the endpoint and credentials you provide. Straightforward to deploy and configure.

AthenaElasticsearchConnectorWithGlueConnection – intended for setups where network connectivity is managed via a Glue Connection. This one could not be made to work. The likely cause is a known limitation documented by AWS:

Due to a known issue, the OpenSearch connector cannot be used with a VPC.

Since we connect to ES through a VPC, this variant is a dead end. All further testing used the standard connector.

Problems

Permissions

A minor friction point, but worth calling out upfront. The IAM permissions required to deploy the connector and register the Athena catalog are spread across a few services – Lambda, Glue, Athena, S3, Secrets Manager, and IAM itself. Before starting, review the permissions required to create a connector and Athena catalog to avoid deploy failures mid-setup.

Schema Inference Errors

Here is the twist!

The connector infers the table schema directly from ES mappings at query time. This breaks when a field has inconsistent types across documents. For example, a field that is a single struct in some documents and an array in others.

The critical detail: the error triggers even when the problematic field is not selected in the query. The connector deserializes the full document before applying column projection, so any type inconsistency anywhere in the index will surface as a query error.

A partial workaround: create a Glue table manually with a schema that includes only the fields you need. The connector detects the Glue table and uses it as the schema, skipping deserialization of fields not listed. This works, but it adds ongoing maintenance overhead. Any new document with an inconsistent field on a column you do care about can still break queries unexpectedly.

Blank Values

One of the indexes returned the correct number of rows, but all field values were empty. The data was present in ES – the connector deserialized the documents without error but produced no values. We did not dig into the root cause because the next issue...

Performance

This was the deciding factor.

Index sizes and query time:

~50 MB (small) ~15 minutes

~13 GB (can we call it large?) No result after 30 minutes

Athena Federated Query works by pushing the query down to the Lambda connector, which fetches data from ES in batches, spills large intermediate results to S3, and returns them to Athena for final aggregation. For large indexes, the sheer volume of data flowing through this path is the bottleneck – ES is not built for full-scan analytics queries, and the connector does not change that.

For a 50 MB index, 15 minutes is already unusable for an analytics workload. For 13 GB indexes, the connector did not complete within 30-minute, we decided not to wait longer.

Assessment

The federated connector is the right idea: live data, no pipeline to maintain, minimal setup. In practice, it ran into too many failures for our use case:

VPC incompatibility ruled out one of the two connector variants entirely
Schema inference errors required a manual Glue schema workaround, with fragile ongoing behavior
One index returned blank results with no clear path to debug
Query performance was 15 minutes on a 50 MB index and didn’t finish on 13 GB indexes

The connector may be viable for small, schema-consistent indexes where query latency requirements are loose. For anything larger or more complex, the performance characteristics make it unsuitable for production analytics use.