18:40
JSON Home Improvement at SCALE 14x
The slides from my presentation, JSON Home Improvement at SCALE 14x, are now available.
18:40
The slides from my presentation, JSON Home Improvement at SCALE 14x, are now available.
13:53
More as an example than anything else, I wanted a function that would take two JSONB objects in PostgreSQL, and return how the left-hand side differs from the right-hand side. This means any key that is in the left but not in the right would be returned, along with any key whose value on the left is different from the right.
Here’s a quick example of how to do this in a single SELECT. In real life, you probably want more error checking, but it shows how nice the built-in primitives are:
CREATE OR REPLACE FUNCTION json_diff(l JSONB, r JSONB) RETURNS JSONB AS
$json_diff$
SELECT jsonb_object_agg(a.key, a.value) FROM
( SELECT key, value FROM jsonb_each(l) ) a LEFT OUTER JOIN
( SELECT key, value FROM jsonb_each(r) ) b ON a.key = b.key
WHERE a.value != b.value OR b.key IS NULL;
$json_diff$
LANGUAGE sql;
15:19
I gave a talk earlier this week about the new features in PostgreSQL 9.5. The headline feature, of course, is the new INSERT ... ON CONFLICT
clause, aka “Upsert,” which allows an INSERT that has a key conflict to do an UPDATE
(or just ignore the insert) instead.
A perfectly reasonable question was: “Does this work on COPY
“? And the answer is no, COPY
doesn’t have an ON CONFLICT
clause. But the answer is actually yes, because you can get these semantics without too much trouble using things you already have around the house.
First, note that INSERT
can take a SELECT
statement as the source of the records to import:
INSERT INTO t(pk, col1, col2)
SELECT pk, col1, col2 FROM t2
ON CONFLICT (pk) DO UPDATE
SET col1 = EXCLUDED.col1,
col2 = EXCLUDED.col2;
So, of course, you could COPY
the import file into a secondary table like t2
above, and then do the INSERT
.
But we don’t have to! We can use the file_fdw
foreign data wrapper to “mount” the text file just like it was a local table.
First, we need to create the file_fdw in the database (it is part of the PostgreSQL distribution);
CREATE EXTENSION file_fdw;
… and a “server” to go with it:
CREATE SERVER import_t FOREIGN DATA WRAPPER file_fdw;
… and then “mount” the table in the local database:
CREATE FOREIGN TABLE t2 (
pk uuid,
col1 integer,
col2 text
) SERVER import_t
OPTIONS ( filename '/path/to/file/t.csv', format 'csv' );
And then we can do the INSERT
operation just as above.
So, we have proper MERGE-type operations in PostgreSQL, right out of the box!
21:53
How much and what to log in PostgreSQL is something that doesn’t really admit a single solution. Logging every connection can swamp a server, as can too low a setting of log_min_statement_duration
. But there are two settings I always turn on: log_lock_waits
, and log_temp_files
(with logtempfiles being set to 0).
log_lock_waits
will log any lock wait that goes on longer than the deadlock_timeout
setting (the same process that checks for deadlocks also emits the log message). By default, this is one second, and if my database has a lock on which a process is waiting for that long, I want to know about it.
log_temp_files
is a size over which a log message will be generated for the temp file creation. Zero is everything. Temp files are (almost always) bad; any time the system needs to do something on disk instead of in memory, I want to know about it.
So, just turn these on, and feed the results through pgbadger. If you are getting so many lock waits or temp files that you are getting excessive logging, the problem is the waits or the temp files, not the logging!
19:39
I’ll be presenting at the January San Francisco PostgreSQL Users’ Group meeting on what’s new in PostgreSQL 9.5. I hope you can join us!
19:45
At the point that there are two separate warnings advising you to turn off a configuration parameter in postgresql.conf, it’s probably a good idea to take the advice and disable it.
In theory, this parameter sets a maximum amount of data that will flow over an SSL connection before key renegotiation, to prevent an eavesdropping attacker from determining the session key through collection of a large amount of ciphertext. In practice, it just causes broken connections and miscellaneous problems. Turn it off, especially in situations where you have funky networking and long-standing SSL connections (such as between a primary and secondary).
09:29
(First in a series of things I always do when setting up or configuring PostgreSQL.)
Since version 9.3, PostgreSQL has had the option to create new database clusters with data checksums. These are fast, simple checksums to verify that the data coming off of disk is what PostgreSQL expects it to be. The checksums don’t correct errors, but it can be a warning that something bad has happened to your data.
Always initialize new clusters with them turned on, unless you are running on a file system (like ZFS) that does checksumming itself. You have uncorrected errors on your disk, so you might as well find them.
Turning them on is the –data-checksums (-k) option to initdb. If you are using Debian packaging, you can set this in the /etc/postgresql-common/createcluster.conf
file.
10:51
I’ll be speaking about “The PCI-Compilant Database” at PGConf Silicon Valley!
09:35
You can’t build a real-life system without caching.
That being said, it’s often the case that parts of the system you think are going to be slow aren’t. I’ve noticed a tendency to build out a huge stack of components (“we’ll have PostgreSQL, and Redis, and Celery, and Varnish, and…”) without actually measuring where the bottlenecks are.
Example: A counter.
Suppose you need a global counter for something. It needs to be super-fast, and available across all of the web front ends. It’s not transactional (you never “uncount” based on a rollback), but you do want it to be persistent.
Option 1: Drop Redis into the stack, use INCR to keep the counter, and have some other process that reads the counter and spills it into PostgreSQL, then have some other process that picks up the count when Redis starts and initializes it (or be smart enough to read from both places and add them when yo need it), and accept that there are windows in which you might use counts.
Option 2: Use SERIAL in PostgreSQL.
But option 2 is really really really slow compared to super-ultra-fast Redis, right?
Not really (test on an Amazon i2-2xlarge instance, client over local sockets, Python client):
INCR
s in Redis: 824 seconds.SELECT nextval('')
in PostgreSQL: 892 seconds.So: Slower. 6.8 microseconds per increment slower. And no elaborate Redis tooling.
So, build for operation, apply load, then decide what to cache. Human intuition about what might be slow is almost certainly wrong.
09:23
Digital Ocean, who I assume are very nice people and meant well, did a Hacktoberfest event in which people were encouraged to submit a pull request to any open source GitHub project. In exchange, “contributors” would get a t-shirt.
You can probably guess what happened:
sigmavirus24: Hey @digitalocean your October pull request event has done nothing but wreak havoc on projects I maintain. I’ve had to reject too many PRs
Reviewing pull requests is a lot of work. Getting buried under trivial pull requests by people who are only after a t-shirt is a great way to get people to take projects private. Please don’t use open source projects as free marketing.