October 5th, 2011 By randy Categories: Security, System Administration

In my prior post I made the case against a rotating password policy and suggested two-factor authentication as a password policy that worked. Two-factor authentication requires both a password that is memorized and an item you have to verify that you are who you say you are.

Two-factor authentication doesn’t have to be expensive. In fact, Google has developed a solution that makes use of the smart phone to generate time-based, one-time verification codes that are unique to each individual user and the machine they are logging in to. It’s called Google Authenticator.

Installing it on FreeBSD is simple, but it probably won’t behave as you intended out of the box. I’ll walk you through installing it and tailoring it to your needs.

Step 1: Install libqrencode. This is optional, but you’ll be happy you did when you see the QR code in your SSH terminal to be scanned by your smartphone. This doesn’t have to be installed first, in fact, you can install it after Google Authenticator is installed as well.

cd /usr/ports/graphics/libqrencode
make && make install

Step 2: Install Google Authenticator.

cd /usr/ports/security/pam_google_authenticator
make && make install

Step 3: Download the Google Authenticator app to your smart phone. Sorry, I can’t help you with this step. Search for it in the market place, it’s free.

Step 4: As the user you’d like to generate a key for, run the program (on your FreeBSD machine):

google-authenticator

Choose ‘y’ for yes to update your .google_authenticator file and scan the QR code into your smartphone’s Google Authenticator app.
Step 5: Configure your OpenPam config file. Lets set it up to work for SSH… Using your favorite editor add the following line to the bottom of your /etc/pam.d/sshd file:

auth     optional     /usr/local/lib/pam_google_authenticator.so

You’re “done!” If you’ve setup Google Authenticator for a particular account it will prompt you for the verification code that your smartphone now generates.

Problems:

You’ve noticed that we used “optional” in the pam config file. This means the use of it is, well, optional. Not only can users login who have not setup Google Authenticator yet, but users who have set it up can get away without using it by leaving the verification code blank after being prompted for it.

If you want to require Google Authenticator for your users you can change that to “required.” However, you then face another problem: users who havn’t set it up yet will not be able to login.

The desired effect should be that users who have set it up are required to use it, but users who have not set it up can get away without it (at least during early deployment).

If you used LinuxPam, this would be easy. LinuxPam allows for conditional statements within the pam config file that upon matching, the specified number of following modules could be skipped. OpenPam, used by FreeBSD, to my knowledge doesn’t have that feature. So, to get the intended functionality we can apply a simple patch to pam_google_authenticator.c.

Warning: I suspect Google didn’t do it this way because with this patch, by setting the control-flag in the pam configuration file to “sufficient” you will essentially open a gapping security hole that allows anyone to log in without a password or a verification code to an account where google-authenticator is not setup. You have been warned. With this patch, only use the “required” control-flag in the pam configuration.

Modify pam_google_authenticator.c by adding the following code just above the ‘// Clean up’ comment:

  if ((rc != PAM_SUCCESS) &&
      (secret_filename != NULL) &&
      (access(secret_filename, F_OK) == -1) &&
      (errno == ENOENT)) {
 
     log_message(LOG_ERR, pamh, "No config file found, skipping authentication");
     rc = PAM_SUCCESS;
  }

It’s pretty straight forward. If the secret_filename (by default .google_authenticator) doesn’t exist in the users home directory, then return the PAM_SUCCESS return code, thus satisfying the two-factor authentication requirement for users who have not set it up.

Enjoy. If you have trouble re-compiling with the changes, post a comment.

September 30th, 2011 By randy Categories: Security

Policies that require users to change their password every couple of months do nothing to increase security. Instead, these policies say quite a bit about the technical philosophy and capabilities of the company or administrator(s) in charge. They say, “I’m a point-and-click administrator” and “I don’t understand security.” I’ll try to make the case that a rotating password policy does nothing to protect against these attacks.

Lets look at a couple of ways in which passwords are often compromised.

  1. A hardware or software based key-logger / virus / spyware is harvesting passwords.
    In this case, the employee or user types a password on a compromised machine, either at home, at a hotel, Internet cafe or on a friend/family members computer.
  2. Password is obtained by someone sniffing network traffic.
    This occurs when only making use of a plaintext protocol.
  3. The server’s password list is compromised.
    You had an SQL injection vulnerability in your web app or your password file was compromised and someone managed to get the hashed list of passwords. If some of the passwords were simple and a basic hash function without salt was used, then some of the passwords could be obtained by a hash-table dictionary or brute force. 
  4. Password was so simple that someone guessed it or that it was brute forced.
    This is rare, but most people who don’t understand security think this is hacking thanks to improper portrayal of hackers in movies.

There are plenty of other ways in which a password can be compromised, but these will suffice for now.

The first and only question that needs to be asked in order to debunk the rotating password policy is this:

Once someone’s passwords is compromised, how does changing it six months later stop the attacker from using it today, tomorrow or next week? Even if the attacker is selling password lists on the black market, every criminal knows a list older than a couple days is useless.

A second point demonstrates the ridiculous nature of forcing people to change their passwords:

When you force anyone to change their password every couple months and demand that it be some complicated combination of alpha-numeric characters, you’re forcing them to write it down, and most uneducated users will leave their written passwords in a cubicle.

A better solution

Two factor authentication is the way to go if you don’t mind inconveniencing people and want to enforce a serious password policy. Two factor authentication requires two things: a memorized password and a physical item that generates a one-time token unique to the user and the time the user is logging in. Usually this is accomplished with a little device that can go on your keychain, but now smart phone apps are capable of providing the same thing. I installed Google Authenticator on one of my servers.

With two factor authentication, even if someone obtains my password, they won’t be able to login without the addition of the physical device I carry around in my pocket. Can two factor authentication be broken? Yes, someone can use one of the methods above to steal my password, then hit me over the head with a hammer and take my device. Someone could also obtain the private key used to generate my one-time tokens, or break the algorithm or even obtain physical access to the server!

Conclusion

Security isn’t about making it impossible to break in. At the end of the day we can concoct a zillion scenarios where even the pentagon could be overtaken. Security is about plausibility and probability. The plausible and probable methods of compromising a machine are not protected by a rotating password policy. They are protected by two factor authentication. So, it’s safe to say that anyone who enforces a rotating password policy probably didn’t think it through.

An added note: real security comes from educating users. I am writing from a perspective that believes most users will never be educated.

July 15th, 2011 By randy Categories: Rants

In 2008, Daniel Miessler provided a list of 25 questions to ask during an information security interview. I thought he did a good job, but wanted to emphasize something he noted that I think a lot of college students reading it might miss. With the economy in a slump, one would think that employers are at an advantage in finding new talent, and in a certain sense they are. However, companies looking for qualified system administrators and information security professionals are having a difficult time finding that talent. At the same time, computer science and information system majors are having a difficult time getting hired.

If you were one of the few student in college who had real interests within your technology based major, you know exactly what the problem is. In fact, you’re not experiencing the same difficulty as everyone else in finding a job, you’re experiencing a different kind of problem. Your problem is that the other people who work in the same field as you are nincompoops.

All too often, college students spend their undergraduate career wasting time. They go to class, study for tests, do their homework and then when they have free time: they waste it. Those who grew up with tiger moms apply it to a higher standard of academic rigor and those brought up in the typical American home practice their much “deserved” rest and relaxation; neither are accomplishing anything. I am not discrediting the importance of academia, good grades or taking time to unwind. I am pointing at a distorted view of college that guarantees a paycheck in four years. Studying for tests and doing homework is necessary and important. However, if getting a piece of paper to satisfy the requirements of a job is the end-goal, you’re wasting your time and money. Your free-time in college, as a student pursuing a technical major, should be spent pursing those same interests and putting the information acquired in the academic settings to good use.

Career service centers in colleges often re-enforce the distorted academic approach that good grades and a piece of paper equals guaranteed success. After all, job placement is how universities convince students to enroll. They encourage students to greater heights in academic rigor in order to increase their competitive advantage in entry-level positions. Then, they teach techniques in acing interviews as if they were another academic hoop to jump through. There is a tendency to read Miessler’s post as a help in this way. In fact, the person who sent me the link was doing just that. He had no technical interests outside of class, is graduating soon and is now studying for a series of exams of another type: those conducted before interviewers. Job seekers, informed by a distorted view of academia, are substituting practical, real-life interests for the same study techniques they learned in college.

If you’re looking for a job as an information security professional or as a system administrator, it’s easy to trick most interviewers. Just practice writing scripts to parse log files (many will certainly ask you to do this), memorize the general details of the TCP/IP stack, be able to articulate how protocols like: HTTP,DNS, TCP, UDP and ICMP work, etc.. However, be careful; I’ve encountered more than one person who could articulate how a traceroute worked with almost textbook accuracy, but hadn’t a clue where to begin, even aided by a man page, if asked to manually perform one using a tool such as ‘ping.’

Miessler makes a distinction between the computer/technology/security enthusiast and the person just looking for a paycheck in his question concerning the home network. In fact, he isn’t writing for college students; he’s writing for interviewers. Here is where the college student should take his or her cue. Everyone I’ve met in college who is doing interesting things today was also doing interesting things, outside of the classroom, back then too. If you want to ace the interview in a couple years, go join a club where like minded individuals are exploring their interests. In my day, it was the Association of Computing Machinery (ACM) that provided this environment. Academia and the theoretical foundation laid by it is important, but all the useful skills that make you competitive in the workplace will come from actually doing something.

April 7th, 2010 By randy Categories: System Administration

There may be situations when you want to throttle the amount of requests a specific user or IP address can make to your website. This works great if you are using Apache as a reverse proxy for security, availability or performance reasons.  Back in the Apache 1.x days there was a module called mod_dosevasive that did just the trick.  Unfortunately it did not work as well in Apache 2.x.

A much better solution is to use a module called mod_security.  mod_security allows you to write sophisticated, stateful rules and take action based on particular conditions. Using mod_security you can do a lot more than DOS evasive maneuvers. You can filter for XSS, SQL Injection, Mail Header Injection and lots more. It uses Perl regular expressions for the win.

In preliminary tests the filter does not block search engine spiders (at least not the ones that count).

If you are using FreeBSD ports, you’ll also need to change the default:

SecRuleEngine DetectionOnly

to:

SecRuleEngine On

and add:

SecDataDir /tmp

in /usr/local/etc/apache22/Includes/mod_security2/modsecurity_crs_10_config.conf

The following can be used in a VirtualHost directive, or included directly in httpd.conf. It can be easily tailored to suit your needs:

# Ignoring media files, count requests made in past 10 seconds.
SecRule REQUEST_BASENAME "!(css|doc|flv|gif|ico|jpg|js|png|swf|pdf)$" \
            "phase:1,nolog,pass,initcol:ip=%{REMOTE_ADDR},setvar:ip.requests=+1"
 
# This is where every other example online goes wrong.  We want the var to expire and leave it
# alone. If we combine this with the increments in the rule above, the timer never expires unless
# there are absolutely no requests for 10 seconds. 
SecRule ip:requests "@le 2" "phase:1,nolog,expirevar:ip.requests=10"
 
# if there were more than 20 requests in 10 seconds for this IP
# set var block to 1 (expires in 30 seconds) and increase var blocks by one (expires in 5 minutes)
SecRule ip:requests "@ge 20" "phase:1,pass,nolog,setvar:ip.block=1,expirevar:ip.block=30,setvar:ip.blocks=+1,setvar:ip.requests=0,expirevar:ip.blocks=300"
 
# If user was blocked more than 5 times (var blocks>5), log and return http 403
SecRule ip:blocks "@ge 5" "phase:1,deny,log,logdata:'req/sec: %{ip.requests}, blocks: %{ip.blocks}',status:403"
 
# if user is blocked (var block=1), log and return http 403
SecRule ip:block "@eq 1" "phase:1,deny,log,logdata:'req/sec: %{ip.requests}, blocks: %{ip.blocks}',status:403"
 
# 403 is some static page or message
ErrorDocument 403 "<html><body><h2>Too many requests.</h2></body></html>"

The above blocks users who send more than 20 requests in a 10 second period. They will be blocked for 30seconds unless this has be a frequent occurrence. If they were blocked more than five times within five minutes they will be blocked for five minutes.

February 15th, 2010 By randy Categories: System Administration

In part I we identified some of the typical culprits of a slow web application.

The art of optimization consists of identifying and diagnosing these bottlenecks without making assumptions. Because bottlenecks can exists at various locations they often produce the same symptoms and can sometimes be interrelated. For example the two separate issues: a poorly configured server, or a poorly designed application can result in the same symptom: swapping under either heavy or light load. Depending on why the swapping occurs will determine what the solution is. The engineer must be able to take measurements, analyze particular system stats and make calculated logical deductions.

In my experience it’s not uncommon for a client to think that they know what the problem is and expect help in implementing their preconceived solution. After all, they did spend the last couple years making and acting on false assumptions.

Perhaps the classic case of this was a company I spoke with who was experiencing growth pains. Traffic was increasing, they had 12 application servers, still had massive bottlenecks and were looking for someone to help setup an infrastructure that would allow them to quickly deploy additional application servers. Their website didn’t do anything unique (none of them do), and the amount of traffic they were receiving was frankly irrelevant. They were convinced, more traffic means we need more servers. They had already attempted to outsource the serving of images to “the cloud” which they admitted was a mistake. They were also in the process of moving full-text searches to Apache Solr. When I asked them what type of hardware they were running- they didn’t know! Neither did they have conclusive evidence concerning where real bottlenecks occurred. They were using a co-managed solution and were paying their webhosting company monthly for each generic server a total of around 20. Additionally they were serving both public and private network traffic which consisted of heavy NFS, MySQL and Proxy traffic from the same NICs and switches.

What shocked me about this company wasn’t how superstitious their developers were, but how certain they were that adding additional servers was the solution. Even with unprivileged access to the servers I knew there was a high likelihood that we could increase performance while decreasing the number of servers even prior to optimizing their application code and doing the work of a DBA. Analysis is not staring at the results of `tail -f /some-big/log-file` and chanting together “more traffic, bots and attackers oh my!”

While not all inclusive, when working with clients these are the typical steps and tools I use during the discovery phase:

  1. Rule out the network layer.
    • If it is network: Is it a bandwidth issue?  Or is it a network configuration issue? Most of the time it is the later. Serving private network data from the same NICs and switches can cause serious latency depending on hardware.
  2. Rule out silly things like large files per page load.
    • One of the first things I do with clients is visit their website and get an understanding of how their application works. You’d be surprised how many websites are perceptually slower due to things like loading many images from many different servers, etc.. You’d think some of these things are common sense.
  3. Survey the load on application and database servers.
  4. Check I/O and Virtual Memory:
    • On a production application and database server during some time of heavy load (artificial or real) I am looking at the results of: `vmstat` and `iostat -dx`
    • I’m not only looking for the use of swap or heavy drive activity but I’m also looking for wasted resources, like a machine only using 3GB of its full 16GB ram or vice-versa.  Sometimes we want ram free for buffer cache others we want to utilize ram for database sort_buffer, query_cache, key_buffer, tmp_table_size, table_open_cache, etc.
  5. Monitor Database (MySQL) Activity
    • Setup a slow query log.
    • `mysqladmin status`
      I’m quickly glancing at open tables vs. opens to see if table caching is misconfiguration or not configured at all, avg. queries per second, slow queries, etc…
    • Probing the size of databases and indexes:
    • mysql> SELECT count(*) TABLES, concat(round(sum(table_rows)/1000000,2),’M') rows, concat(round(sum(data_length)/(1024*1024*1024),2),’G') DATA, concat(round(sum(index_length)/(1024*1024*1024),2),’G') idx, concat(round(sum(data_length+index_length)/(1024*1024*1024),2),’G') total_size, round(sum(index_length)/sum(data_length),2) idxfrac FROM information_schema.TABLESWhich results in something like:
        +--------+-------+-------+-------+------------+---------+
      | TABLES | rows  | DATA  | idx   | total_size | idxfrac |
      +--------+-------+-------+-------+------------+---------+
      |   1146 | 3.75M | 0.64G | 0.07G | 0.70G      |    0.11 |
      +--------+-------+-------+-------+------------+---------+

      Or for a break down of each specific database:

      mysql> SELECT count(*) TABLES, table_schema,concat(round(sum(table_rows)/1000000,2),’M') rows, concat(round(sum(data_length)/(1024*1024*1024),2),’G') DATA, concat(round(sum(index_length)/(1024*1024*1024),2),’G') idx, concat(round(sum(data_length+index_length)/(1024*1024*1024),2),’G') total_size, round(sum(index_length)/sum(data_length),2) idxfrac FROM information_schema.TABLES GROUP BY table_schema ORDER BY sum(data_length+index_length) DESC LIMIT 7;

    • `mysqladmin ext -ri10`
      Initially gives us some very valuable server stats and then repeats them every ten seconds but showing the difference in values between the last and current. Using this we can view initial red flags and see what the MySQL server did within the last ten seconds to confirm their legitimacy.We’re looking for warning signs about the use of created_disk_tmp_tables, aborted_clients, admin_commands, incorrect use of query caches, key_sorts, mis-use of thread_caching, open files, etc.
    • `mysqlreport` and `mysqltuner.pl` are also wonderful third-party tools to help automate calculations for use in analysis. Those tools provide the luxury of retrieving the information and making many calculations for you.
  6. I look at their backend code, data structures, algorithms and DB queries and see how they can be optimized.
February 10th, 2010 By randy Categories: @Home

Perhaps a future post will demonstrate the use of FreeBSD for wireless AP’s in a commercial environment with roaming. This post will demonstrate a basic home router setup.

Hardware:

  • My wireless card (ath0) is equipped with the Atheros chipset.
  • Ethernet Nic (re0) is connected to a cable modem.
  • Ethernet Nic (em0) is connected to a switch for wired internet access.

Network:

  • Internal NAT: 10.0.0.0/24
  • We’ll bridge (bridge0) em0 and ath0′s wlan device (wlan0).
  • ISC-DHCP31 will respond to DHCP requests.
  • Packet Filter (PF) will do our routing.

You will need to know what to replace with your own configuration (not much).

Step 1: Install & Configure ISC-DHCP31 Server

  1. `cd /usr/ports/net/isc-dhcp31-server`
  2. `make && make install`
  3. Add dhcpd_enable=”YES” to your /etc/rc.conf file
  4. My /usr/local/etc/dhcp.conf looks like this (be sure to change the domain-name and any other custom settings):
subnet 10.0.0.0 netmask 255.255.255.0 {
  range 10.0.0.2 10.0.0.254;
  option domain-name-servers 4.2.2.1;
  option domain-name "CANAAN";
  option routers 10.0.0.1;
  option broadcast-address 10.0.0.255;
  default-lease-time 600;
  max-lease-time 7200;
}

Step 2: Configure Network Settings

  1. Add the following to /etc/rc.conf
pf_enable="YES"
pf_rules="/etc/pf.conf"
gateway_enable="YES"
wlans_ath0="wlan0"
create_args_wlan0="wlanmode ap"
ifconfig_re0="dhcp"   #remember this is my cable modem, it gets an IP address via DHCP
cloned_interfaces="bridge0"
ifconfig_bridge0="addm wlan0 addm em0"
ipv4_addrs_bridge0="10.0.0.1/24"
ifconfig_em0="up"
ifconfig_wlan0="ssid chicken up"
hostname="CANAAN" #You'll want to change this.

Step 3: Configure Packet Filter

  1. Add the following to /etc/pf.conf
nat on re0 from 10.0.0.0/24 to any -> (re0)

REMEMBER: re0 is the ethernet device connected to my cable modem. Your setup WILL be different. Want to learn more about that Packet Filter rule? Here is an EXCELLENT tutorial: http://www.openbsd.org/faq/pf/nat.html

Done! Who thought it could be so simple?

You can either restart your computer or:

  1. `/etc/rc.d/netif restart`
  2. `sysctl net.inet.ip.forwarding=1`
  3. `/etc/rc.d/pf start`
  4. `/usr/local/etc/rc.d/isc-dhcpd start`
February 6th, 2010 By randy Categories: System Administration

In a previous post, “DNS and Offsite Failover“, I documented the implementation of an automated, offsite, fail-over using DNS. For static websites there is nothing left to do except perhaps use `rsync` to keep the files up to date.

Unfortunately there is a lot more to think about for dynamic websites and applications that make use of a database.

  1. The databases need to be sync’d realtime.
  2. Keeping track of state is important, each server needs to know when it and its counter-parts are up, down, active or standby.
  3. In the event of a failover, they need to coordinate cleanup, merge changes and start replicating again.

With the level of complexity involved small businesses typically accept defeat and embrace the potential downtime. Some even plan for a manual failover in the event of a disaster.

There are two automated solutions that can be weighed in the balances.

The Reduced Functionality Offsite Failover
The easiest solution by far is to implement what I call the “Read-Only” or reduced functionality offsite failover. In this configuration we’ll keep another database sync’d offsite. In the event of downtime the offsite backup takes over but in a reduced functionality mode. If the site supports user login or performs transactions, they can be disabled temporarily. Interactive sites effectively become “read-only” for the time being.

This poses a problem for commercial sites whose business is to perform transactions. Potential sales are lost, but at least new and old customers can still learn about products and get information. Nobody receives an ugly “Server not found” browser error; in-fact a custom message can be crafted explaining that full functionality will return shortly. This would be great for university sites, magazines, journals or even popular blogs where the primary purpose of the site is to get information.

This makes the life of the system administrator easy because the primary server doesn’t need to keep track of state, never needs to merge changes later on and can continue as it was when and if the network connection is restored.

In fact, if it’s acceptable for the offsite server to be a day behind the primary, implementation can be as simple as our prior DNS solution and a nightly cronjob that looks something like this:

1
2
3
4
5
6
7
8
9
10
11
#!/bin/sh
ssh USERNAME@primary.location.edu 'mysqldump --single-transaction -uUSERNAME -pPASSWORD -h DBSERVER DATABASE' | /usr/local/bin/mysql -uLOCALUSERNAME -pLOCALPASSWORD LOCALDATABASE
MYSQL='/usr/local/bin/mysql -uLOCALUSERNAME -pLOCALPASSWORD'
$MYSQL <<EOF
use LOCALDATABASE;
delete from shared_sessions;
delete from main_cache_page;
delete from groups_cache_page;
insert into main_access (mask, type, status) values ('%', 'user', 0);
insert into groups_access (mask, type, status) values ('%', 'user', 0);
EOF

The above is an example of a Drupal site being placed into “read-only” mode. The ENTIRE database is pulled from the primary location via SSH (shared key is used instead of a password). The session table and cache is cleared and Drupal’s access table is modified to block all but the admin user from logging in. If the database is large it’s best to avoid this method and stick to replication. Replication will also give you real-time updates. We’ll cover that in the “fully functional implementation” in Part II. You can borrow elements from both while developing your own custom solution.

You can even set server variables in Apache:

setenv readonly yes

And in the theme layer detect the server variable ‘readonly’ to know when to hide the login box or display your custom reduced functionality nofication so that two separate code bases do not need to be maintained. A Drupal example:

66
67
68
  if ($region == 'login_slide' && !$_SERVER['readonly']) {    // Don't display the login field if the server is in read-only mode.
    drupal_set_content('login_slide', pframe_login_slide());
  }

Fully Functional Offsite Failover
There are few sites that can afford to deny customer transactions. Reduced functionality mode is the easiest solution to implement but isn’t very practical for most business models. It’s harder to implement a fully functional offsite failover. As stated prior this requires both machines to keep track of state and resolve data conflicts when the primary changes from down to up. The logic is increasingly more complicated when you take into account the transition back from secondary to primary. The delays introduced by DNS caching and ISPs not respecting TTLs can potentially lead visitors to both primary and secondary locations at the same time and throw our databases out of sync. This method requires a lot more thought.

We’ll address the implementation of the fully functional offsite failover in Part II.

February 5th, 2010 By randy Categories: System Administration

The Internet was engineered with fail over and high availability in mind. Using basic concepts of DNS it’s easy to implement automatic offsite fail over in the event of a network outage or disaster.

Setup:

  • Master DNS and web servers are located at our primary location.
  • Secondary DNS and web servers are located at an offsite location.

Logic:

  • If Internet connectivity is lost at the primary location both web and DNS servers will fail to respond to requests.
  • The secondary DNS server continually monitors these services and in the event of an outage modifies the slave’s DNS so that requests are directed to the offsite location.
  • When the Master comes back online DNS replication automatically kicks back in and reverts DNS to its original settings.

I have written the following script to monitor both primary web and DNS servers for outages. When the amount of servers that are still up fall below $limit a search and replace is performed on the slave’s zone file. The reason I avoided a graceful restart via ‘rndc’ is so that I would not have to increment the serial number of the slave and thus throw the master/slave out of sync.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#!/usr/bin/perl -w
use strict;
# Randy Sofia 2009
# DNS Modifier
# Check listed servers to see if they are up. 
# If less than $limit are up then @searchreplace $zonefile.
 
#----------------[SETTINGS]----------------
my @webservers; 
push @webservers, ("10.0.0.30", "10.0.0.40");
 
my @dnsservers;
push @dnsservers, ("10.0.0.38", "10.0.0.22");
 
my $zonefile="/etc/namedb/slave/db.slave.edu";
 
my @searchreplace;
push @searchreplace, ('10.0.0.30', '10.10.0.20');
push @searchreplace, ('10.0.0.40', '10.10.0.20');
 
my $limit=1;    # Minimum amount of servers up 
#------------------------------------------
 
my $result;
my $totalservers=scalar(@webservers) + scalar(@dnsservers);
my $upcount=$totalservers;
 
foreach my $webserver (@webservers) {
        $result = `/usr/local/bin/wget -q -t1 -T2 -O - $webserver`;
        $upcount-- if (!$result)
}
 
foreach my $dnsserver (@dnsservers) {
        $result = `/sbin/ping -t1 -c1 $dnsserver`;
        $upcount-- if ($result =~ /.+0 packets received/) 
}
 
printf ("%d/%d servers up", $upcount, $totalservers);
 
&ChangeDNS($zonefile, @searchreplace) if ($upcount < $limit);
 
sub ChangeDNS  {
        my $filebuffer;
        my $file=shift;
        open READZONEFILE, $file or die $!;
 
        while (my $line = <READZONEFILE>) {
                for (my $i=0; $i<scalar(@_); $i=$i+2) {
                        $line =~ s/$_[$i]/$_[$i+1]/g;
                }
                $filebuffer .= $line;
        }
        close(READZONEFILE);
        open WRITEZONEFILE, ">", $file or die $!;
                print WRITEZONEFILE $filebuffer;
        close(WRITEZONEFILE);
        system("/etc/rc.d/named restart");
}

Set the TTL of your DNS to a couple minutes and set to run every minute or so in your crontab. I did not write this script with portability in mind so you may have to modify the location of your binaries and zone file. Binaries are default for FreeBSD. You’ll also need to install wget. Master/slave DNS configuration is not covered here.

Warning: Do not attempt to implement this unless you understand the logic behind it and are capable of performing tests in a staging environment. Additional variables such as a tertiary offsite location require additional tweaks to the logic that make this solution work. We are available for consulting if in doubt.

February 4th, 2010 By randy Categories: System Administration

It is often interesting hearing the different responses and solutions to the problem of a sluggish website. Responses range from adding new hardware (ram, io, cpu) and application servers, jumping on the latest lightweight web server (Nginx, lighthttpd), caching solutions (memcached, boost) all the way to “outsourcing” the serving of images to other services like Amazon Cloud or fulltext search to Apache Solr. While it is entirely possible these can contribute to speeding things up, more often than not they result in insignificant gains and a bigger system administration headache. Inexperienced system administrators and consultants stare at the results of `nload`, they `tail -f` their log files and boast of the enormous amount of hits they get from bots, users and “attackers” when quite often these observations have little to do with any significant discovery phase.

Where do people get these ideas? We live in an age where hardware is abundant and cheap.The hardware age has trained us to stop being scientific in our discovery methods and it has spoiled developers who can now get away with writing inefficient code. Big numbers razzle dazzle the inexperienced who boast about the size of server farms, how many hits they get and their use of enterprise tools to keep things in check. Management sets hard deadlines and has little respect for the art of optimization as much as they do with getting things done quickly. After all, why be scientific if I can throw another server at it or install a prepackaged solution that will move things along a bit? At the same time the internet is filled with experts, trends, and advice that is often misapplied, misinterpreted and adopted without good reason. People are always anxious to jump on the next bandwagon and use the latest buzz words in hopes that what solved someone else’s problem will solve theirs.

Unfortunately, there is a lot to system optimization and none of these things can be described in full. In my experience these are the common culprits:

  • Poor development practices.
    • Database queries are inefficient.
    • Poor choice of algorithms & data structures.
    • Bloated frameworks.
    • Hit or miss caching: caching the wrong things but not caching the right things.
  • Bad Configuration
    • Apache / MySQL / PHP – left to defaults.
    • Wrong hardware choices/configuration for particular applications.
    • Using swap unnecessarily.
    • Using the same interface for public and local network traffic.
    • Network services doing stupid things (dns lookups for local hosts, etc.)
    • The dreaded black box – commercial server appliances.
    • Overly complicated network topology.
  • Lack of common sense
    • Website is full of large images/media.
    • Storing images in a database.

In future posts we’ll look into the methods of discovering these bottlenecks and how to go about addressing them.

February 4th, 2010 By randy Categories: System Administration

Many modern file systems provide a feature called snapshots. As the name implies, snapshots allow you to make an “image” of a file system with one beneficial feature: multiple “images” can be made over time without the space requirements of storing multiple copies of your data. This is accomplished by referencing previously unchanged data rather than storing it multiple times over.

Perhaps you’d like to provide web developers access to files from 20 days ago simply by browsing to a directory /home/backup/20_days_ago/ .  From here you’d like them to be able to look at individual files and determine if 20 days is sufficient and if not try /home/backup/21_days_ago. If the set of files you want to backup is 90GB and you’d prefer not to waste 90GB*30days (2.7TB) worth of disk space, snapshots are for you.

Both ZFS and UFS provide the functionality to create snapshots at the file system level and it’s relatively simple to do. If you don’t have the luxury of using ZFS or for some reason it’s not feasible using your current filesystem there is another quick and dirty method, you can use rsync.

The ZFS Method:

1
2
3
4
5
#!/usr/local/bin/bash
days=31
pool=home/sites
zfs snapshot $pool@DAILY-`date +%Y-%m-%d`
zfs list -t snapshot -o name | grep $pool@DAILY- | sort -r | tail -n +$days | xargs -n 1 zfs destroy

Chmod the above script executable and give it a place in your daily crontab. It will automatically purge snapshots that are older than the specified number of days. The above script can easily be modified to take in arguments or create snapshots in smaller increments.

Files can be found and browsed in the directory /home/sites/.zfs/snapshot/

The rsync method:
Thanks to basic file system principles this can also be accomplished without the use of file system snapshots. Here is a quick way to accomplish the task of storing the last 31 days worth of “snapshots” of a directory from any ‘nix file system.  In this example we’ll create snapshots of /home/sites/  in /home/backup/{0..31}_days_ago

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/local/bin/bash
source=/home/sites/
dest=/home/backup/
 
test -d $dest || mkdir $dest
rm -rf $dest"31_days_ago/"
 
for i in {31..1}
do
     prev=$(($i-1))
     mv $dest$prev"_days_ago" $dest$i"_days_ago"
done
/usr/local/bin/rsync -a --delete --link-dest=$dest"1_days_ago" $source $dest"0_days_ago/"

Don’t forget to set it as executable `chmod 755 above_script`

Cron entry:
0 23 * * * /location/of/above_script > /dev/null

Each time the script is run it moves the directories x_days_ago to x+1_days_ago and creates 0_days_ago as the most recent snapshot.